Represent integers as bit vectors, e.g. a set of flags.
Parameters:
values (pdarray, int64) – The integers to represent as bit vectors
width (int) – The number of bit fields in the vector
reverse (bool) – If True, display bits from least significant (left) to most
significant (right). By default, the most significant bit
is the left-most bit.
This class is a thin wrapper around pdarray that mostly affects
how values are displayed to the user. Operators and methods will
typically treat this class like a uint64 pdarray.
Register this BitVector object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name the BitVector is to be registered under,
this will be the root name for underlying components
Returns:
The same BitVector which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support
a fluid programming style.
Please note you cannot register two different BitVectors with the same name.
Make a callback (i.e. function) that can be called on an
array to create a BitVector.
Parameters:
width (int) – The number of bit fields in the vector
reverse (bool) – If True, display bits from least significant (left) to most
significant (right). By default, the most significant bit
is the left-most bit.
Returns:
bitvectorizer – A function that takes an array and returns a BitVector instance
Custom property-like object.
A descriptor for caching accessors.
:param name: Namespace that will be accessed under, e.g. df.foo.
:type name: str
:param accessor: Class with the extension methods.
:type accessor: cls
Notes
For accessor, The class’s __init__ method assumes that one of
Series, DataFrame or Index as the
single argument data.
initialdata (List or dictionary of lists, tuples, or pdarrays) – Each list/dictionary entry corresponds to one column of the data and
should be a homogenous type. Different columns may have different
types. If using a dictionary, keys should be strings.
index (Index, pdarray, or Strings) – Index for the resulting frame. Defaults to an integer range.
columns (List, tuple, pdarray, or Strings) – Column labels to use if the data does not include them. Elements must
be strings. Defaults to an stringified integer range.
Examples
Create an empty DataFrame and add a column of data:
Group the dataframe by a column or a list of columns.
Parameters:
keys (str or list of str) – An (ordered) list of column names or a single string to group by.
use_series (bool, default=False) – If True, returns an arkouda.dataframe.DataFrameGroupBy object.
Otherwise an arkouda.groupbyclass.GroupBy object.
as_index (bool, default=True) – If True, groupby columns will be set as index
otherwise, the groupby columns will be treated as DataFrame columns.
dropna (bool, default=True) – If True, and the groupby keys contain NaN values,
the NaN values together with the corresponding row will be dropped.
Otherwise, the rows corresponding to NaN values will be kept.
Returns:
If use_series = True, returns an arkouda.dataframe.DataFrameGroupBy object.
Otherwise returns an arkouda.groupbyclass.GroupBy object.
Concatenate data from ‘other’ onto the end of this DataFrame, in place.
Explicitly, use the arkouda concatenate function to append the data
from each column in other to the end of self. This operation is done
in place, in the sense that the underlying pdarrays are updated from
the result of the arkouda concatenate function, rather than returning
a new DataFrame object containing the result.
Parameters:
other (DataFrame) – The DataFrame object whose data will be appended to this DataFrame.
ordered (bool, default=True) – If False, allow rows to be interleaved for better performance (but
data within a row remains together). By default, append all rows
to the end, in input order.
Returns:
Appending occurs in-place, but result is returned for compatibility.
Apply a permutation to an entire DataFrame. The operation is done in
place and the original DataFrame will be modified.
This may be useful if you want to unsort an DataFrame, or even to
apply an arbitrary permutation such as the inverse of a sorting
permutation.
Parameters:
perm (pdarray) – A permutation array. Should be the same size as the data
arrays, and should consist of the integers [0,size-1] in
some order. Very minimal testing is done to ensure this
is a permutation.
Returns a new object with all original columns in addition to new ones.
Existing columns that are re-assigned will be overwritten.
Parameters:
**kwargs (dict of {str: callable or Series}) – The column names are keywords. If the values are
callable, they are computed on the DataFrame and
assigned to the new columns. The callable must not
change input DataFrame (though pandas doesn’t check it).
If the values are not callable, (e.g. a Series, scalar, or array),
they are simply assigned.
Returns:
A new DataFrame with the new columns in addition to
all the existing columns.
Assigning multiple columns within the same assign is possible.
Later items in ‘**kwargs’ may refer to newly created or modified
columns in ‘df’; items are computed and assigned into ‘df’ in order.
axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’ counts are generated for each column.
If 1 or ‘columns’ counts are generated for each row.
numeric_only (bool = False) – Include only float, int or boolean data.
Returns:
For each column/row the number of non-NA/null entries.
Group the dataframe by a column or a list of columns. Alias for GroupBy.
Parameters:
keys (str or list of str) – An (ordered) list of column names or a single string to group by.
use_series (bool, default=True) – If True, returns an arkouda.dataframe.DataFrameGroupBy object.
Otherwise an arkouda.groupbyclass.GroupBy object.
as_index (bool, default=True) – If True, groupby columns will be set as index
otherwise, the groupby columns will be treated as DataFrame columns.
dropna (bool, default=True) – If True, and the groupby keys contain NaN values,
the NaN values together with the corresponding row will be dropped.
Otherwise, the rows corresponding to NaN values will be kept.
Returns:
If use_series = True, returns an arkouda.dataframe.DataFrameGroupBy object.
Otherwise returns an arkouda.groupbyclass.GroupBy object.
When values is a pdarray, check every value in the DataFrame to determine if
it exists in values.
>>> df.isin(ak.array([0,1]))
col_A
col_B
0
0
1
1
0
0
When values is a dict, the values in the dict are passed to check the column
indicated by the key.
>>> df.isin({'col_A':ak.array([0,3])})
col_A
col_B
0
0
0
1
1
0
When values is a Series, each column is checked if values is present positionally.
This means that for True to be returned, the indexes must be the same.
Return a boolean same-sized object indicating if the values are NA.
numpy.NaN values get mapped to True values.
Everything else gets mapped to False values.
Returns:
Mask of bool values for each element in DataFrame
that indicates whether an element is an NA value.
The memory usage can optionally include the contribution of
the index.
Parameters:
index (bool, default True) – Specifies whether to include the memory usage of the DataFrame’s
index in returned Series. If index=True, the memory usage of
the index is the first item in the output.
unit (str, default = "B") – Unit to return. One of {‘B’, ‘KB’, ‘MB’, ‘GB’}.
Returns:
A Series whose index is the original column names and whose values
is the memory usage of each column in bytes.
Merge Arkouda DataFrames with a database-style join.
The resulting dataframe contains rows from both DataFrames as specified by
the merge condition (based on the “how” and “on” parameters).
right (DataFrame) – The Right DataFrame to be joined.
on (Optional[Union[str, List[str]]] = None) – The name or list of names of the DataFrame column(s) to join on.
If on is None, this defaults to the intersection of the columns in both DataFrames.
how ({"inner", "left", "right}, default = "inner") – The merge condition.
Must be “inner”, “left”, or “right”.
left_suffix (str, default = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping
column names in both left and right. Defaults to “_x”. Only used when how is “inner”.
right_suffix (str, default = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping
column names in both left and right. Defaults to “_y”. Only used when how is “inner”.
convert_ints (bool = True) – If True, convert columns with missing int values (due to the join) to float64.
This is to match pandas.
If False, do not convert the column dtypes.
This has no effect when how = “inner”.
sort (bool = True) – If True, DataFrame is returned sorted by “on”.
Otherwise, the DataFrame is not sorted.
Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to ‘strict’.
Read the columns of a CSV file into an Arkouda DataFrame.
If the file contains the appropriately formatted header, typed data will be returned.
Otherwise, all data will be returned as a Strings objects.
Parameters:
filename (str) – Filename to read data from.
col_delim (str, default=",") – The delimiter for columns within the data.
Returns:
Arkouda DataFrame containing the columns from the CSV file.
ValueError – Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist.
RuntimeError – Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server.
Register this DataFrame object and underlying components with the Arkouda server.
Parameters:
user_defined_name (str) – User defined name the DataFrame is to be registered under.
This will be the root name for underlying components.
Returns:
The same DataFrame which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different DataFrames with the same name.
mapper (callable or dict-like, Optional) – Function or dictionary mapping existing values to new values.
Nonexistent names will not raise an error.
Uses the value of axis to determine if renaming column or index
column (callable or dict-like, Optional) – Function or dictionary mapping existing column names to
new column names. Nonexistent names will not raise an
error.
When this is set, axis is ignored.
index (callable or dict-like, Optional) – Function or dictionary mapping existing index names to
new index names. Nonexistent names will not raise an
error.
When this is set, axis is ignored.
axis (int or str, default=0) – Indicates which axis to perform the rename.
0/”index” - Indexes
1/”column” - Columns
inplace (bool, default=False) – When True, perform the operation on the calling object.
When False, return a new object.
Returns:
DateFrame when inplace=False;
None when inplace=True.
Useful if this dataframe is the result of a slice operation from
another dataframe, or if you have permuted the rows and no longer need
to keep that ordering on the rows.
Parameters:
size (int, optional) – If size is passed, do not attempt to determine size based on
existing column sizes. Assume caller handles consistency correctly.
inplace (bool, default=False) – When True, perform the operation on the calling object.
When False, return a new object.
Returns:
DateFrame when inplace=False;
None when inplace=True.
Writes DataFrame to CSV file(s). File will contain a column for each column in the DataFrame.
All CSV Files written by Arkouda include a header denoting data types of the columns.
Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing
bytes as uint(8).
Parameters:
path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended
when they are written to disk.
index (bool, default=False) – If True, the index of the DataFrame will be written to the file
as a column.
columns (list of str (Optional)) – Column names to assign when writing data.
col_delim (str, default=",") – Value to be used to separate columns within the file.
Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool, default=False) – If True, any existing files matching your provided prefix_path will
be overwritten. If False, an error will be returned if existing files are found.
Return type:
None
Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist.
RuntimeError – Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server.
Notes
CSV format is not currently supported by load/load_all operations.
The column delimiter is expected to be the same for column names and data.
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (”\n”) at this time.
storage_options (dict, optional) – Extra options that make sense for a particular storage connection,
e.g. host, port, username, password, etc., if using a URL that will be parsed by fsspec,
e.g., starting “s3://”, “gcs://”.
An error will be raised if providing this argument with a non-fsspec URL.
See the fsspec and backend storage implementation docs for the set
of allowed keys and values.
**kwargs – These parameters will be passed to tabulate.
datalimit (int, default=arkouda.client.maxTransferBytes) – The maximum number size, in megabytes to transfer. The requested
DataFrame will be converted to a pandas DataFrame only if the
estimated size of the DataFrame does not exceed this value.
retain_index (bool, default=False) – Normally, to_pandas() creates a new range index object. If you want
to keep the index column, set this to True.
Returns:
The result of converting this DataFrame to a pandas DataFrame.
Save DataFrame to disk as parquet, preserving column names.
Parameters:
path (str) – File path to save data.
index (bool, default=False) – If True, save the index column. By default, do not save the index.
columns (list) – List of columns to include in the file. If None, writes out all columns.
compression (str (Optional), default=None) – Provide the compression type to use when writing the file.
Supported values: snappy, gzip, brotli, zstd, lz4
convert_categoricals (bool, default=False) – Parquet requires all columns to be the same size and Categoricals
don’t satisfy that requirement.
If set, write the equivalent Strings in place of any Categorical columns.
Return type:
None
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
This method saves one file per locale of the arkouda server. All
files are prefixed by the path argument and suffixed by their
locale number.
hostname (str) – The hostname where the Arkouda server intended to
receive the DataFrame is running.
port (int_scalars) – The port to send the array over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
ak.receive_array().
Returns:
A message indicating a complete transfer.
Return type:
str
Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not
a supported dtype
Overwrite the dataset with the name provided with this dataframe. If
the dataset does not exist it is added.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share.
index (bool, default=False) – If True, save the index column. By default, do not save the index.
columns (List, default=None) – List of columns to include in the file. If None, writes out all columns.
repack (bool, default=True) – HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Returns:
Success message if successful.
Return type:
str
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray.
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added.
initialdata (List or dictionary of lists, tuples, or pdarrays) – Each list/dictionary entry corresponds to one column of the data and
should be a homogenous type. Different columns may have different
types. If using a dictionary, keys should be strings.
index (Index, pdarray, or Strings) – Index for the resulting frame. Defaults to an integer range.
columns (List, tuple, pdarray, or Strings) – Column labels to use if the data does not include them. Elements must
be strings. Defaults to an stringified integer range.
Examples
Create an empty DataFrame and add a column of data:
Group the dataframe by a column or a list of columns.
Parameters:
keys (str or list of str) – An (ordered) list of column names or a single string to group by.
use_series (bool, default=False) – If True, returns an arkouda.dataframe.DataFrameGroupBy object.
Otherwise an arkouda.groupbyclass.GroupBy object.
as_index (bool, default=True) – If True, groupby columns will be set as index
otherwise, the groupby columns will be treated as DataFrame columns.
dropna (bool, default=True) – If True, and the groupby keys contain NaN values,
the NaN values together with the corresponding row will be dropped.
Otherwise, the rows corresponding to NaN values will be kept.
Returns:
If use_series = True, returns an arkouda.dataframe.DataFrameGroupBy object.
Otherwise returns an arkouda.groupbyclass.GroupBy object.
Concatenate data from ‘other’ onto the end of this DataFrame, in place.
Explicitly, use the arkouda concatenate function to append the data
from each column in other to the end of self. This operation is done
in place, in the sense that the underlying pdarrays are updated from
the result of the arkouda concatenate function, rather than returning
a new DataFrame object containing the result.
Parameters:
other (DataFrame) – The DataFrame object whose data will be appended to this DataFrame.
ordered (bool, default=True) – If False, allow rows to be interleaved for better performance (but
data within a row remains together). By default, append all rows
to the end, in input order.
Returns:
Appending occurs in-place, but result is returned for compatibility.
Apply a permutation to an entire DataFrame. The operation is done in
place and the original DataFrame will be modified.
This may be useful if you want to unsort an DataFrame, or even to
apply an arbitrary permutation such as the inverse of a sorting
permutation.
Parameters:
perm (pdarray) – A permutation array. Should be the same size as the data
arrays, and should consist of the integers [0,size-1] in
some order. Very minimal testing is done to ensure this
is a permutation.
Returns a new object with all original columns in addition to new ones.
Existing columns that are re-assigned will be overwritten.
Parameters:
**kwargs (dict of {str: callable or Series}) – The column names are keywords. If the values are
callable, they are computed on the DataFrame and
assigned to the new columns. The callable must not
change input DataFrame (though pandas doesn’t check it).
If the values are not callable, (e.g. a Series, scalar, or array),
they are simply assigned.
Returns:
A new DataFrame with the new columns in addition to
all the existing columns.
Assigning multiple columns within the same assign is possible.
Later items in ‘**kwargs’ may refer to newly created or modified
columns in ‘df’; items are computed and assigned into ‘df’ in order.
axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’ counts are generated for each column.
If 1 or ‘columns’ counts are generated for each row.
numeric_only (bool = False) – Include only float, int or boolean data.
Returns:
For each column/row the number of non-NA/null entries.
Group the dataframe by a column or a list of columns. Alias for GroupBy.
Parameters:
keys (str or list of str) – An (ordered) list of column names or a single string to group by.
use_series (bool, default=True) – If True, returns an arkouda.dataframe.DataFrameGroupBy object.
Otherwise an arkouda.groupbyclass.GroupBy object.
as_index (bool, default=True) – If True, groupby columns will be set as index
otherwise, the groupby columns will be treated as DataFrame columns.
dropna (bool, default=True) – If True, and the groupby keys contain NaN values,
the NaN values together with the corresponding row will be dropped.
Otherwise, the rows corresponding to NaN values will be kept.
Returns:
If use_series = True, returns an arkouda.dataframe.DataFrameGroupBy object.
Otherwise returns an arkouda.groupbyclass.GroupBy object.
When values is a pdarray, check every value in the DataFrame to determine if
it exists in values.
>>> df.isin(ak.array([0,1]))
col_A
col_B
0
0
1
1
0
0
When values is a dict, the values in the dict are passed to check the column
indicated by the key.
>>> df.isin({'col_A':ak.array([0,3])})
col_A
col_B
0
0
0
1
1
0
When values is a Series, each column is checked if values is present positionally.
This means that for True to be returned, the indexes must be the same.
Return a boolean same-sized object indicating if the values are NA.
numpy.NaN values get mapped to True values.
Everything else gets mapped to False values.
Returns:
Mask of bool values for each element in DataFrame
that indicates whether an element is an NA value.
The memory usage can optionally include the contribution of
the index.
Parameters:
index (bool, default True) – Specifies whether to include the memory usage of the DataFrame’s
index in returned Series. If index=True, the memory usage of
the index is the first item in the output.
unit (str, default = "B") – Unit to return. One of {‘B’, ‘KB’, ‘MB’, ‘GB’}.
Returns:
A Series whose index is the original column names and whose values
is the memory usage of each column in bytes.
Merge Arkouda DataFrames with a database-style join.
The resulting dataframe contains rows from both DataFrames as specified by
the merge condition (based on the “how” and “on” parameters).
right (DataFrame) – The Right DataFrame to be joined.
on (Optional[Union[str, List[str]]] = None) – The name or list of names of the DataFrame column(s) to join on.
If on is None, this defaults to the intersection of the columns in both DataFrames.
how ({"inner", "left", "right}, default = "inner") – The merge condition.
Must be “inner”, “left”, or “right”.
left_suffix (str, default = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping
column names in both left and right. Defaults to “_x”. Only used when how is “inner”.
right_suffix (str, default = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping
column names in both left and right. Defaults to “_y”. Only used when how is “inner”.
convert_ints (bool = True) – If True, convert columns with missing int values (due to the join) to float64.
This is to match pandas.
If False, do not convert the column dtypes.
This has no effect when how = “inner”.
sort (bool = True) – If True, DataFrame is returned sorted by “on”.
Otherwise, the DataFrame is not sorted.
Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to ‘strict’.
Read the columns of a CSV file into an Arkouda DataFrame.
If the file contains the appropriately formatted header, typed data will be returned.
Otherwise, all data will be returned as a Strings objects.
Parameters:
filename (str) – Filename to read data from.
col_delim (str, default=",") – The delimiter for columns within the data.
Returns:
Arkouda DataFrame containing the columns from the CSV file.
ValueError – Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist.
RuntimeError – Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server.
Register this DataFrame object and underlying components with the Arkouda server.
Parameters:
user_defined_name (str) – User defined name the DataFrame is to be registered under.
This will be the root name for underlying components.
Returns:
The same DataFrame which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different DataFrames with the same name.
mapper (callable or dict-like, Optional) – Function or dictionary mapping existing values to new values.
Nonexistent names will not raise an error.
Uses the value of axis to determine if renaming column or index
column (callable or dict-like, Optional) – Function or dictionary mapping existing column names to
new column names. Nonexistent names will not raise an
error.
When this is set, axis is ignored.
index (callable or dict-like, Optional) – Function or dictionary mapping existing index names to
new index names. Nonexistent names will not raise an
error.
When this is set, axis is ignored.
axis (int or str, default=0) – Indicates which axis to perform the rename.
0/”index” - Indexes
1/”column” - Columns
inplace (bool, default=False) – When True, perform the operation on the calling object.
When False, return a new object.
Returns:
DateFrame when inplace=False;
None when inplace=True.
Useful if this dataframe is the result of a slice operation from
another dataframe, or if you have permuted the rows and no longer need
to keep that ordering on the rows.
Parameters:
size (int, optional) – If size is passed, do not attempt to determine size based on
existing column sizes. Assume caller handles consistency correctly.
inplace (bool, default=False) – When True, perform the operation on the calling object.
When False, return a new object.
Returns:
DateFrame when inplace=False;
None when inplace=True.
Writes DataFrame to CSV file(s). File will contain a column for each column in the DataFrame.
All CSV Files written by Arkouda include a header denoting data types of the columns.
Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing
bytes as uint(8).
Parameters:
path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended
when they are written to disk.
index (bool, default=False) – If True, the index of the DataFrame will be written to the file
as a column.
columns (list of str (Optional)) – Column names to assign when writing data.
col_delim (str, default=",") – Value to be used to separate columns within the file.
Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool, default=False) – If True, any existing files matching your provided prefix_path will
be overwritten. If False, an error will be returned if existing files are found.
Return type:
None
Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist.
RuntimeError – Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server.
Notes
CSV format is not currently supported by load/load_all operations.
The column delimiter is expected to be the same for column names and data.
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (”\n”) at this time.
storage_options (dict, optional) – Extra options that make sense for a particular storage connection,
e.g. host, port, username, password, etc., if using a URL that will be parsed by fsspec,
e.g., starting “s3://”, “gcs://”.
An error will be raised if providing this argument with a non-fsspec URL.
See the fsspec and backend storage implementation docs for the set
of allowed keys and values.
**kwargs – These parameters will be passed to tabulate.
datalimit (int, default=arkouda.client.maxTransferBytes) – The maximum number size, in megabytes to transfer. The requested
DataFrame will be converted to a pandas DataFrame only if the
estimated size of the DataFrame does not exceed this value.
retain_index (bool, default=False) – Normally, to_pandas() creates a new range index object. If you want
to keep the index column, set this to True.
Returns:
The result of converting this DataFrame to a pandas DataFrame.
Save DataFrame to disk as parquet, preserving column names.
Parameters:
path (str) – File path to save data.
index (bool, default=False) – If True, save the index column. By default, do not save the index.
columns (list) – List of columns to include in the file. If None, writes out all columns.
compression (str (Optional), default=None) – Provide the compression type to use when writing the file.
Supported values: snappy, gzip, brotli, zstd, lz4
convert_categoricals (bool, default=False) – Parquet requires all columns to be the same size and Categoricals
don’t satisfy that requirement.
If set, write the equivalent Strings in place of any Categorical columns.
Return type:
None
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
This method saves one file per locale of the arkouda server. All
files are prefixed by the path argument and suffixed by their
locale number.
hostname (str) – The hostname where the Arkouda server intended to
receive the DataFrame is running.
port (int_scalars) – The port to send the array over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
ak.receive_array().
Returns:
A message indicating a complete transfer.
Return type:
str
Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not
a supported dtype
Overwrite the dataset with the name provided with this dataframe. If
the dataset does not exist it is added.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share.
index (bool, default=False) – If True, save the index column. By default, do not save the index.
columns (List, default=None) – List of columns to include in the file. If None, writes out all columns.
repack (bool, default=True) – HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Returns:
Success message if successful.
Return type:
str
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray.
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added.
A DataFrame that has been grouped by a subset of columns.
Parameters:
gb_key_names (str or list(str), default=None) – The column name(s) associated with the aggregated columns.
as_index (bool, default=True) – If True, interpret aggregated column as index
(only implemented for single dimensional aggregates).
Otherwise, treat aggregated column as a dataframe column.
GroupBy object, where the aggregation keys are values of column(s) of a dataframe,
usually in preparation for aggregating with respect to the other columns.
x (Series or pdarray) – The values to put in each group’s segment.
permute (bool, default=True) – If True (default), permute broadcast values back to the
ordering of the original array on which GroupBy was called.
If False, the broadcast values are grouped by value.
Returns:
A Series with the Index of the original frame and the values of the broadcast.
n (int, optional, default = 5) – Maximum number of rows to return for each group.
If the number of rows in a group is less than n,
all the values from that group will be returned.
sort_index (bool, default = True) – If true, return the DataFrame with indices sorted.
Return a random sample from each group. You can either specify the number of elements
or the fraction of elements to be sampled. random_state can be used for reproducibility
Parameters:
n (int, optional) – Number of items to return for each group.
Cannot be used with frac and must be no larger than
the smallest group unless replace is True.
Default is one if frac is None.
frac (float, optional) – Fraction of items to return. Cannot be used with n.
replace (bool, default False) – Allow or disallow sampling of the same row more than once.
weights (pdarray, optional) – Default None results in equal probability weighting.
If passed a pdarray, then values must have the same length as the underlying DataFrame
and will be used as sampling probabilities after normalization within each group.
Weights must be non-negative with at least one positive element within each group.
random_state (int or ak.random.Generator, optional) – If int, seed for random number generator.
If ak.random.Generator, use as given.
Returns:
A new DataFrame containing items randomly sampled from each group
sorted according to the grouped columns.
n (int, optional, default = 5) – Maximum number of rows to return for each group.
If the number of rows in a group is less than n,
all the rows from that group will be returned.
sort_index (bool, default = True) – If true, return the DataFrame with indices sorted.
DataSources can be local files or remote files/URLs. The files may
also be compressed or uncompressed. DataSource hides some of the
low-level details of downloading the file, allowing you to simply pass
in a valid file path (or URL) and obtain a file object.
Parameters:
destpath (str or None, optional) – Path to the directory where the source file gets downloaded to for
use. If destpath is None, a temporary directory will be created.
The default path is the current directory.
Notes
URLs require a scheme string (http://) to be used, without it they
will fail:
Return absolute path of file in the DataSource directory.
If path is an URL, then abspath will return either the location
the file exists locally or the location it would exist when opened
using the open method.
Parameters:
path (str) – Can be a local file or a remote URL.
Returns:
out – Complete path, including the DataSource destination directory.
a remote URL that has been downloaded and stored locally in the
DataSource directory.
a remote URL that has not been downloaded, but is valid and
accessible.
Parameters:
path (str) – Can be a local file or a remote URL.
Returns:
out – True if path exists.
Return type:
bool
Notes
When path is an URL, exists will return True if it’s either
stored locally in the DataSource directory, or is a valid remote
URL. DataSource does not discriminate between the two, the file
is accessible if it exists in either location.
If path is an URL, it will be downloaded, stored in the
DataSource directory and opened from there.
Parameters:
path (str) – Local file path or URL to open.
mode ({'r', 'w', 'a'}, optional) – Mode to open path. Mode ‘r’ for reading, ‘w’ for writing,
‘a’ to append. Available modes depend on the type of object
specified by path. Default is ‘r’.
encoding ({None, str}, optional) – Open text file with given encoding. The default encoding will be
what io.open uses.
newline ({None, str}, optional) – Newline to use when reading text file.
Datetime is the Arkouda analog to pandas DatetimeIndex and
other timeseries data types.
Parameters:
pda (int64 pdarray, pd.DatetimeIndex, pd.Series, or np.datetime64 array)
unit (str, default 'ns') –
For int64 pdarray, denotes the unit of the input. Ignored for pandas
and numpy arrays, which carry their own unit. Not case-sensitive;
prefixes of full names (like ‘sec’) are accepted.
Possible values:
’weeks’ or ‘w’
’days’ or ‘d’
’hours’ or ‘h’
’minutes’, ‘m’, or ‘t’
’seconds’ or ‘s’
’milliseconds’, ‘ms’, or ‘l’
’microseconds’, ‘us’, or ‘u’
’nanoseconds’, ‘ns’, or ‘n’
Unlike in pandas, units cannot be combined or mixed with integers
Notes
The .values attribute is always in nanoseconds with int64 dtype.
Register this Datetime object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name the Datetime is to be registered under,
this will be the root name for underlying components
Returns:
The same Datetime which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support
a fluid programming style.
Please note you cannot register two different Datetimes with the same name.
Datetime is the Arkouda analog to pandas DatetimeIndex and
other timeseries data types.
Parameters:
pda (int64 pdarray, pd.DatetimeIndex, pd.Series, or np.datetime64 array)
unit (str, default 'ns') –
For int64 pdarray, denotes the unit of the input. Ignored for pandas
and numpy arrays, which carry their own unit. Not case-sensitive;
prefixes of full names (like ‘sec’) are accepted.
Possible values:
’weeks’ or ‘w’
’days’ or ‘d’
’hours’ or ‘h’
’minutes’, ‘m’, or ‘t’
’seconds’ or ‘s’
’milliseconds’, ‘ms’, or ‘l’
’microseconds’, ‘us’, or ‘u’
’nanoseconds’, ‘ns’, or ‘n’
Unlike in pandas, units cannot be combined or mixed with integers
Notes
The .values attribute is always in nanoseconds with int64 dtype.
Register this Datetime object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name the Datetime is to be registered under,
this will be the root name for underlying components
Returns:
The same Datetime which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support
a fluid programming style.
Please note you cannot register two different Datetimes with the same name.
Datetime is the Arkouda analog to pandas DatetimeIndex and
other timeseries data types.
Parameters:
pda (int64 pdarray, pd.DatetimeIndex, pd.Series, or np.datetime64 array)
unit (str, default 'ns') –
For int64 pdarray, denotes the unit of the input. Ignored for pandas
and numpy arrays, which carry their own unit. Not case-sensitive;
prefixes of full names (like ‘sec’) are accepted.
Possible values:
’weeks’ or ‘w’
’days’ or ‘d’
’hours’ or ‘h’
’minutes’, ‘m’, or ‘t’
’seconds’ or ‘s’
’milliseconds’, ‘ms’, or ‘l’
’microseconds’, ‘us’, or ‘u’
’nanoseconds’, ‘ns’, or ‘n’
Unlike in pandas, units cannot be combined or mixed with integers
Notes
The .values attribute is always in nanoseconds with int64 dtype.
Register this Datetime object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name the Datetime is to be registered under,
this will be the root name for underlying components
Returns:
The same Datetime which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support
a fluid programming style.
Please note you cannot register two different Datetimes with the same name.
The bool_ type is not a subclass of the int_ type
(the bool_ is not even a number type). This is different
than Python’s default implementation of bool as a
sub-class of int.
An integer-backed representation of a set of named binary fields, e.g. flags.
Parameters:
values (pdarray or Strings) – The array of field values. If (u)int64, the values are used as-is for the
binary representation of fields. If Strings, the values are converted
to binary according to the mapping defined by the names and MSB_left
arguments.
names (str or sequence of str) – The names of the fields, in order. A string will be treated as a list
of single-character field names. Multi-character field names are allowed,
but must be passed as a list or tuple and user must specify a separator.
MSB_left (bool) – Controls how field names are mapped to binary values. If True (default),
the left-most field name corresponds to the most significant bit in the
binary representation. If False, the left-most field name corresponds to
the least significant bit.
pad (str) – Character to display when field is not present. Use empty string if no
padding is desired.
separator (str) – Substring that separates fields. Used to parse input values (if ak.Strings)
and to display output.
show_int (bool) – If True (default), display the integer value of the binary fields in output.
This class is a thin wrapper around pdarray that mostly affects
how values are displayed to the user. Operators and methods will
typically treat this class like an int64 pdarray.
Generator exposes a number of methods for generating random
numbers drawn from a variety of probability distributions. In addition to
the distribution-specific arguments, each method takes a keyword argument
size that defaults to None. If size is None, then a single
value is generated and returned. If size is an integer, then a 1-D
array filled with generated values is returned.
Parameters:
seed (int) – Seed to allow for reproducible random number generation.
name_dict (dict) – Dictionary mapping the server side names associated with
the generators for each dtype.
state (int) – The current state we are in the random number generation stream.
This information makes it so calls to any dtype generator
function affects the stream of random numbers for the other generators.
This mimics the behavior we see in numpy
for x>0 and 0 elsewhere. \(\beta\) is the scale parameter,
which is the inverse of the rate parameter \(\lambda = 1/\beta\).
The rate parameter is an alternative, widely used parameterization
of the exponential distribution.
Parameters:
scale (float or pdarray) – The scale parameter, \(\beta = 1/\lambda\). Must be
non-negative. An array must have the same size as the size argument.
size (numeric_scalars, optional) – Output shape. Default is None, in which case a single value is returned.
method (str, optional) – Either ‘inv’ or ‘zig’. ‘inv’ uses the default inverse CDF method.
‘zig’ uses the Ziggurat method.
Returns:
Drawn samples from the parameterized exponential distribution.
Return random integers from low (inclusive) to high (exclusive),
or if endpoint=True, low (inclusive) to high (inclusive).
Return random integers from the “discrete uniform” distribution of the specified dtype.
If high is None (the default), then results are from 0 to low.
Parameters:
low (numeric_scalars) – Lowest (signed) integers to be drawn from the distribution (unless high=None,
in which case this parameter is 0 and this value is used for high).
high (numeric_scalars) – If provided, one above the largest (signed) integer to be drawn from the distribution
(see above for behavior if high=None)
size (numeric_scalars) – Output shape. Default is None, in which case a single value is returned.
dtype (dtype, optional) – Desired dtype of the result. The default value is ak.int64.
endpoint (bool, optional) – If true, sample from the interval [low, high] instead of the default [low, high).
Defaults to False
Returns:
Values drawn uniformly from the specified range having the desired dtype,
or a single such random int if size not provided.
where \(\mu\) is the location and \(s\) is the scale.
The Logistic distribution is used in Extreme Value problems where it can act
as a mixture of Gumbel distributions, in Epidemiology, and by the World Chess Federation (FIDE)
where it is used in the Elo ranking system, assuming the performance of each player
is a logistically distributed random variable.
Returns:
Pdarray of floats (unless size=None, in which case a single float is returned).
Draw samples from a log-normal distribution with specified mean,
standard deviation, and array shape.
Note that the mean and standard deviation are not the values for the distribution itself,
but of the underlying normal distribution it is derived from.
Parameters:
mean (float or pdarray of floats, optional) – Mean of the distribution. Default of 0.
sigma (float or pdarray of floats, optional) – Standard deviation of the distribution. Must be non-negative. Default of 1.
size (numeric_scalars, optional) – Output shape. Default is None, in which case a single value is returned.
method (str, optional) – Either ‘box’ or ‘zig’. ‘box’ uses the Box–Muller transform
‘zig’ uses the Ziggurat method.
Notes
A variable x has a log-normal distribution if log(x) is normally distributed.
The probability density for the log-normal distribution is:
\[p(x) = \frac{1}{\sigma x \sqrt{2\pi}} e^{-\frac{(\ln(x)-\mu)^2}{2\sigma^2}}\]
where \(\mu\) is the mean and \(\sigma\) the standard deviation of the normally
distributed logarithm of the variable.
A log-normal distribution results if a random variable is the product of a
large number of independent, identically-distributed variables in the same
way that a normal distribution results if the variable is
the sum of a large number of independent, identically-distributed variables.
Returns:
Pdarray of floats (unless size=None, in which case a single float is returned).
The Poisson distribution is the limit of the binomial distribution for large N.
Parameters:
lam (float or pdarray) – Expected number of events occurring in a fixed-time interval, must be >= 0.
An array must have the same size as the size argument.
size (numeric_scalars, optional) – Output shape. Default is None, in which case a single value is returned.
Notes
The probability mass function for the Poisson distribution is:
For events with an expected separation \(\lambda\), the Poisson distribution
\(f(k; \lambda)\) describes the probability of \(k\) events occurring
within the observed interval \(\lambda\)
Returns:
Pdarray of ints (unless size=None, in which case a single int is returned).
Samples are uniformly distributed over the half-open interval [low, high).
In other words, any value within the given interval is equally likely to be drawn by uniform.
Parameters:
low (float, optional) – Lower boundary of the output interval. All values generated will be greater than or
equal to low. The default value is 0.
high (float, optional) – Upper boundary of the output interval. All values generated will be less than high.
high must be greater than or equal to low. The default value is 1.0.
size (numeric_scalars, optional) – Output shape. Default is None, in which case a single value is returned.
Returns:
Pdarray of floats (unless size=None, in which case a single float is returned).
If True, and the groupby keys contain NaN values,
the NaN values together with the corresponding row will be dropped.
Otherwise, the rows corresponding to NaN values will be kept.
Type:
bool (default=True)
Raises:
TypeError – Raised if keys is a pdarray with a dtype other than int64
Notes
Integral pdarrays, Strings, and Categoricals are natively supported, but
float64 and bool arrays are not.
For a user-defined class to be groupable, it must inherit from pdarray
and define or overload the grouping API:
a ._get_grouping_keys() method that returns a list of pdarrays
that can be (co)argsorted.
(Optional) a .group() method that returns the permutation that
groups the array
If the input is a single array with a .group() method defined, method 2
will be used; otherwise, method 1 will be used.
Using the permutation stored in the GroupBy instance, group
another array of values and return the location of the first
maximum of each group’s values.
Parameters:
values (pdarray) – The values to group and find argmax
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argmaxima (pdarray, int64) – One index per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax
is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or
if the operator is not in the GroupBy.Reductions array
Notes
The returned indices refer to the original values array as passed in,
not the permutation applied by the GroupBy instance.
Using the permutation stored in the GroupBy instance, group
another array of values and return the location of the first
minimum of each group’s values.
Parameters:
values (pdarray) – The values to group and find argmin
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argminima (pdarray, int64) – One index per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax
is not supported for the values dtype
ValueError – Raised if the key array size does not match the values
size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if argmin is not supported for the values dtype
Notes
The returned indices refer to the original values array as
passed in, not the permutation applied by the GroupBy instance.
values (pdarray, Strings) – The values to put in each group’s segment
permute (bool) – If True (default), permute broadcast values back to the ordering
of the original array on which GroupBy was called. If False, the
broadcast values are grouped by value.
TypeError – Raised if value is not a pdarray object
ValueError – Raised if the values array does not have one
value per segment
Notes
This function is a sparse analog of np.broadcast. If a
GroupBy object represents a sparse matrix (tensor), then
this function takes a (dense) column vector and replicates
each value to the non-zero elements in the corresponding row.
Examples
>>> a=ak.array([0,1,0,1,0])>>> values=ak.array([3,5])>>> g=ak.GroupBy(a)# By default, result is in original order>>> g.broadcast(values)array([3, 5, 3, 5, 3])# With permute=False, result is in grouped order>>> g.broadcast(values,permute=False)array([3, 3, 3, 5, 5]>>> a=ak.randint(1,5,10)>>> aarray([3, 1, 4, 4, 4, 1, 3, 3, 2, 2])>>> g=ak.GroupBy(a)>>> keys,counts=g.size()>>> g.broadcast(counts>2)array([True False True True True False True True False False])>>> g.broadcast(counts==3)array([True False True True True False True True False False])>>> g.broadcast(counts<4)array([True True True True True True True True True True])
function to build a new GroupBy object from component keys and permutation.
Parameters:
user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name
kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”
Returns:
The GroupBy object created by using the given components
values ((list of) pdarray-like) – The values from which to select, according to their group membership.
n (int, optional, default = 5) – Maximum number of items to return for each group.
If the number of values in a group is less than n,
all the values from that group will be returned.
return_indices (bool, default False) – If True, return the indices of the sampled values.
Otherwise, return the selected values.
Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result (pdarray-like) – The first n items of each group.
If return_indices is True, the result are indices.
O.W. the result are values.
Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to ‘strict’.
Register this GroupBy object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name the GroupBy is to be registered under,
this will be the root name for underlying components
Returns:
The same GroupBy which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different GroupBys with the same name.
Return a random sample from each group. You can either specify the number of elements
or the fraction of elements to be sampled. random_state can be used for reproducibility
Parameters:
values ((list of) pdarray-like) – The values from which to sample, according to their group membership.
n (int, optional) – Number of items to return for each group.
Cannot be used with frac and must be no larger than
the smallest group unless replace is True.
Default is one if frac is None.
frac (float, optional) – Fraction of items to return. Cannot be used with n.
replace (bool, default False) – Allow or disallow sampling of the value more than once.
weights (pdarray, optional) – Default None results in equal probability weighting.
If passed a pdarray, then values must have the same length as the groupby keys
and will be used as sampling probabilities after normalization within each group.
Weights must be non-negative with at least one positive element within each group.
random_state (int or ak.random.Generator, optional) – If int, seed for random number generator.
If ak.random.Generator, use as given.
return_indices (bool, default False) – if True, return the indices of the sampled values.
Otherwise, return the sample values.
permute_samples (bool, default False) – if True, return permute the samples according to group
Otherwise, keep samples in original order.
Returns:
if return_indices is True, return the indices of the sampled values.
Otherwise, return the sample values.
Using the permutation stored in the GroupBy instance, group
another array of values and compute the standard deviation of
each group’s values.
Parameters:
values (pdarray) – The values to group and find standard deviation
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_stds (pdarray, float64) – One std value per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size
or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The standard deviation is the square root of the average of the squared
deviations from the mean, i.e., std=sqrt(mean((x-x.mean())**2)).
The average squared deviation is normally calculated as
x.sum()/N, where N=len(x). If, however, ddof is specified,
the divisor N-ddof is used instead. In standard statistical
practice, ddof=1 provides an unbiased estimator of the variance
of the infinite population. ddof=0 provides a maximum likelihood
estimate of the variance for normally distributed variables. The
standard deviation computed in this function is the square root of
the estimated variance, so even with ddof=1, it will not be an
unbiased estimate of the standard deviation per se.
values ((list of) pdarray-like) – The values from which to select, according to their group membership.
n (int, optional, default = 5) – Maximum number of items to return for each group.
If the number of values in a group is less than n,
all the values from that group will be returned.
return_indices (bool, default False) – If True, return the indices of the sampled values.
Otherwise, return the selected values.
Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result (pdarray-like) – The last n items of each group.
If return_indices is True, the result are indices.
O.W. the result are values.
Save the GroupBy to HDF5. The result is a collection of HDF5 files, one file
per locale of the arkouda server, where each filename starts with prefix_path.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist.
If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Using the permutation stored in the GroupBy instance, group
another array of values and compute the variance of
each group’s values.
Parameters:
values (pdarray) – The values to group and find variance
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_vars (pdarray, float64) – One var value per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size
or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The variance is the average of the squared deviations from the mean,
i.e., var=mean((x-x.mean())**2).
The mean is normally calculated as x.sum()/N, where N=len(x).
If, however, ddof is specified, the divisor N-ddof is used
instead. In standard statistical practice, ddof=1 provides an
unbiased estimator of the variance of a hypothetical infinite population.
ddof=0 provides a maximum likelihood estimate of the variance for
normally distributed variables.
If True, and the groupby keys contain NaN values,
the NaN values together with the corresponding row will be dropped.
Otherwise, the rows corresponding to NaN values will be kept.
Type:
bool (default=True)
Raises:
TypeError – Raised if keys is a pdarray with a dtype other than int64
Notes
Integral pdarrays, Strings, and Categoricals are natively supported, but
float64 and bool arrays are not.
For a user-defined class to be groupable, it must inherit from pdarray
and define or overload the grouping API:
a ._get_grouping_keys() method that returns a list of pdarrays
that can be (co)argsorted.
(Optional) a .group() method that returns the permutation that
groups the array
If the input is a single array with a .group() method defined, method 2
will be used; otherwise, method 1 will be used.
Using the permutation stored in the GroupBy instance, group
another array of values and return the location of the first
maximum of each group’s values.
Parameters:
values (pdarray) – The values to group and find argmax
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argmaxima (pdarray, int64) – One index per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax
is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or
if the operator is not in the GroupBy.Reductions array
Notes
The returned indices refer to the original values array as passed in,
not the permutation applied by the GroupBy instance.
Using the permutation stored in the GroupBy instance, group
another array of values and return the location of the first
minimum of each group’s values.
Parameters:
values (pdarray) – The values to group and find argmin
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argminima (pdarray, int64) – One index per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax
is not supported for the values dtype
ValueError – Raised if the key array size does not match the values
size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if argmin is not supported for the values dtype
Notes
The returned indices refer to the original values array as
passed in, not the permutation applied by the GroupBy instance.
values (pdarray, Strings) – The values to put in each group’s segment
permute (bool) – If True (default), permute broadcast values back to the ordering
of the original array on which GroupBy was called. If False, the
broadcast values are grouped by value.
TypeError – Raised if value is not a pdarray object
ValueError – Raised if the values array does not have one
value per segment
Notes
This function is a sparse analog of np.broadcast. If a
GroupBy object represents a sparse matrix (tensor), then
this function takes a (dense) column vector and replicates
each value to the non-zero elements in the corresponding row.
Examples
>>> a=ak.array([0,1,0,1,0])>>> values=ak.array([3,5])>>> g=ak.GroupBy(a)# By default, result is in original order>>> g.broadcast(values)array([3, 5, 3, 5, 3])# With permute=False, result is in grouped order>>> g.broadcast(values,permute=False)array([3, 3, 3, 5, 5]>>> a=ak.randint(1,5,10)>>> aarray([3, 1, 4, 4, 4, 1, 3, 3, 2, 2])>>> g=ak.GroupBy(a)>>> keys,counts=g.size()>>> g.broadcast(counts>2)array([True False True True True False True True False False])>>> g.broadcast(counts==3)array([True False True True True False True True False False])>>> g.broadcast(counts<4)array([True True True True True True True True True True])
function to build a new GroupBy object from component keys and permutation.
Parameters:
user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name
kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”
Returns:
The GroupBy object created by using the given components
values ((list of) pdarray-like) – The values from which to select, according to their group membership.
n (int, optional, default = 5) – Maximum number of items to return for each group.
If the number of values in a group is less than n,
all the values from that group will be returned.
return_indices (bool, default False) – If True, return the indices of the sampled values.
Otherwise, return the selected values.
Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result (pdarray-like) – The first n items of each group.
If return_indices is True, the result are indices.
O.W. the result are values.
Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to ‘strict’.
Register this GroupBy object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name the GroupBy is to be registered under,
this will be the root name for underlying components
Returns:
The same GroupBy which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different GroupBys with the same name.
Return a random sample from each group. You can either specify the number of elements
or the fraction of elements to be sampled. random_state can be used for reproducibility
Parameters:
values ((list of) pdarray-like) – The values from which to sample, according to their group membership.
n (int, optional) – Number of items to return for each group.
Cannot be used with frac and must be no larger than
the smallest group unless replace is True.
Default is one if frac is None.
frac (float, optional) – Fraction of items to return. Cannot be used with n.
replace (bool, default False) – Allow or disallow sampling of the value more than once.
weights (pdarray, optional) – Default None results in equal probability weighting.
If passed a pdarray, then values must have the same length as the groupby keys
and will be used as sampling probabilities after normalization within each group.
Weights must be non-negative with at least one positive element within each group.
random_state (int or ak.random.Generator, optional) – If int, seed for random number generator.
If ak.random.Generator, use as given.
return_indices (bool, default False) – if True, return the indices of the sampled values.
Otherwise, return the sample values.
permute_samples (bool, default False) – if True, return permute the samples according to group
Otherwise, keep samples in original order.
Returns:
if return_indices is True, return the indices of the sampled values.
Otherwise, return the sample values.
Using the permutation stored in the GroupBy instance, group
another array of values and compute the standard deviation of
each group’s values.
Parameters:
values (pdarray) – The values to group and find standard deviation
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_stds (pdarray, float64) – One std value per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size
or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The standard deviation is the square root of the average of the squared
deviations from the mean, i.e., std=sqrt(mean((x-x.mean())**2)).
The average squared deviation is normally calculated as
x.sum()/N, where N=len(x). If, however, ddof is specified,
the divisor N-ddof is used instead. In standard statistical
practice, ddof=1 provides an unbiased estimator of the variance
of the infinite population. ddof=0 provides a maximum likelihood
estimate of the variance for normally distributed variables. The
standard deviation computed in this function is the square root of
the estimated variance, so even with ddof=1, it will not be an
unbiased estimate of the standard deviation per se.
values ((list of) pdarray-like) – The values from which to select, according to their group membership.
n (int, optional, default = 5) – Maximum number of items to return for each group.
If the number of values in a group is less than n,
all the values from that group will be returned.
return_indices (bool, default False) – If True, return the indices of the sampled values.
Otherwise, return the selected values.
Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result (pdarray-like) – The last n items of each group.
If return_indices is True, the result are indices.
O.W. the result are values.
Save the GroupBy to HDF5. The result is a collection of HDF5 files, one file
per locale of the arkouda server, where each filename starts with prefix_path.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist.
If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Using the permutation stored in the GroupBy instance, group
another array of values and compute the variance of
each group’s values.
Parameters:
values (pdarray) – The values to group and find variance
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_vars (pdarray, float64) – One var value per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size
or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The variance is the average of the squared deviations from the mean,
i.e., var=mean((x-x.mean())**2).
The mean is normally calculated as x.sum()/N, where N=len(x).
If, however, ddof is specified, the divisor N-ddof is used
instead. In standard statistical practice, ddof=1 provides an
unbiased estimator of the variance of a hypothetical infinite population.
ddof=0 provides a maximum likelihood estimate of the variance for
normally distributed variables.
If True, and the groupby keys contain NaN values,
the NaN values together with the corresponding row will be dropped.
Otherwise, the rows corresponding to NaN values will be kept.
Type:
bool (default=True)
Raises:
TypeError – Raised if keys is a pdarray with a dtype other than int64
Notes
Integral pdarrays, Strings, and Categoricals are natively supported, but
float64 and bool arrays are not.
For a user-defined class to be groupable, it must inherit from pdarray
and define or overload the grouping API:
a ._get_grouping_keys() method that returns a list of pdarrays
that can be (co)argsorted.
(Optional) a .group() method that returns the permutation that
groups the array
If the input is a single array with a .group() method defined, method 2
will be used; otherwise, method 1 will be used.
Using the permutation stored in the GroupBy instance, group
another array of values and return the location of the first
maximum of each group’s values.
Parameters:
values (pdarray) – The values to group and find argmax
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argmaxima (pdarray, int64) – One index per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax
is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or
if the operator is not in the GroupBy.Reductions array
Notes
The returned indices refer to the original values array as passed in,
not the permutation applied by the GroupBy instance.
Using the permutation stored in the GroupBy instance, group
another array of values and return the location of the first
minimum of each group’s values.
Parameters:
values (pdarray) – The values to group and find argmin
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argminima (pdarray, int64) – One index per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax
is not supported for the values dtype
ValueError – Raised if the key array size does not match the values
size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if argmin is not supported for the values dtype
Notes
The returned indices refer to the original values array as
passed in, not the permutation applied by the GroupBy instance.
values (pdarray, Strings) – The values to put in each group’s segment
permute (bool) – If True (default), permute broadcast values back to the ordering
of the original array on which GroupBy was called. If False, the
broadcast values are grouped by value.
TypeError – Raised if value is not a pdarray object
ValueError – Raised if the values array does not have one
value per segment
Notes
This function is a sparse analog of np.broadcast. If a
GroupBy object represents a sparse matrix (tensor), then
this function takes a (dense) column vector and replicates
each value to the non-zero elements in the corresponding row.
Examples
>>> a=ak.array([0,1,0,1,0])>>> values=ak.array([3,5])>>> g=ak.GroupBy(a)# By default, result is in original order>>> g.broadcast(values)array([3, 5, 3, 5, 3])# With permute=False, result is in grouped order>>> g.broadcast(values,permute=False)array([3, 3, 3, 5, 5]>>> a=ak.randint(1,5,10)>>> aarray([3, 1, 4, 4, 4, 1, 3, 3, 2, 2])>>> g=ak.GroupBy(a)>>> keys,counts=g.size()>>> g.broadcast(counts>2)array([True False True True True False True True False False])>>> g.broadcast(counts==3)array([True False True True True False True True False False])>>> g.broadcast(counts<4)array([True True True True True True True True True True])
function to build a new GroupBy object from component keys and permutation.
Parameters:
user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name
kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”
Returns:
The GroupBy object created by using the given components
values ((list of) pdarray-like) – The values from which to select, according to their group membership.
n (int, optional, default = 5) – Maximum number of items to return for each group.
If the number of values in a group is less than n,
all the values from that group will be returned.
return_indices (bool, default False) – If True, return the indices of the sampled values.
Otherwise, return the selected values.
Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result (pdarray-like) – The first n items of each group.
If return_indices is True, the result are indices.
O.W. the result are values.
Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to ‘strict’.
Register this GroupBy object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name the GroupBy is to be registered under,
this will be the root name for underlying components
Returns:
The same GroupBy which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different GroupBys with the same name.
Return a random sample from each group. You can either specify the number of elements
or the fraction of elements to be sampled. random_state can be used for reproducibility
Parameters:
values ((list of) pdarray-like) – The values from which to sample, according to their group membership.
n (int, optional) – Number of items to return for each group.
Cannot be used with frac and must be no larger than
the smallest group unless replace is True.
Default is one if frac is None.
frac (float, optional) – Fraction of items to return. Cannot be used with n.
replace (bool, default False) – Allow or disallow sampling of the value more than once.
weights (pdarray, optional) – Default None results in equal probability weighting.
If passed a pdarray, then values must have the same length as the groupby keys
and will be used as sampling probabilities after normalization within each group.
Weights must be non-negative with at least one positive element within each group.
random_state (int or ak.random.Generator, optional) – If int, seed for random number generator.
If ak.random.Generator, use as given.
return_indices (bool, default False) – if True, return the indices of the sampled values.
Otherwise, return the sample values.
permute_samples (bool, default False) – if True, return permute the samples according to group
Otherwise, keep samples in original order.
Returns:
if return_indices is True, return the indices of the sampled values.
Otherwise, return the sample values.
Using the permutation stored in the GroupBy instance, group
another array of values and compute the standard deviation of
each group’s values.
Parameters:
values (pdarray) – The values to group and find standard deviation
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_stds (pdarray, float64) – One std value per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size
or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The standard deviation is the square root of the average of the squared
deviations from the mean, i.e., std=sqrt(mean((x-x.mean())**2)).
The average squared deviation is normally calculated as
x.sum()/N, where N=len(x). If, however, ddof is specified,
the divisor N-ddof is used instead. In standard statistical
practice, ddof=1 provides an unbiased estimator of the variance
of the infinite population. ddof=0 provides a maximum likelihood
estimate of the variance for normally distributed variables. The
standard deviation computed in this function is the square root of
the estimated variance, so even with ddof=1, it will not be an
unbiased estimate of the standard deviation per se.
values ((list of) pdarray-like) – The values from which to select, according to their group membership.
n (int, optional, default = 5) – Maximum number of items to return for each group.
If the number of values in a group is less than n,
all the values from that group will be returned.
return_indices (bool, default False) – If True, return the indices of the sampled values.
Otherwise, return the selected values.
Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result (pdarray-like) – The last n items of each group.
If return_indices is True, the result are indices.
O.W. the result are values.
Save the GroupBy to HDF5. The result is a collection of HDF5 files, one file
per locale of the arkouda server, where each filename starts with prefix_path.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist.
If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Using the permutation stored in the GroupBy instance, group
another array of values and compute the variance of
each group’s values.
Parameters:
values (pdarray) – The values to group and find variance
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_vars (pdarray, float64) – One var value per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size
or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The variance is the average of the squared deviations from the mean,
i.e., var=mean((x-x.mean())**2).
The mean is normally calculated as x.sum()/N, where N=len(x).
If, however, ddof is specified, the divisor N-ddof is used
instead. In standard statistical practice, ddof=1 provides an
unbiased estimator of the variance of a hypothetical infinite population.
ddof=0 provides a maximum likelihood estimate of the variance for
normally distributed variables.
If True, and the groupby keys contain NaN values,
the NaN values together with the corresponding row will be dropped.
Otherwise, the rows corresponding to NaN values will be kept.
Type:
bool (default=True)
Raises:
TypeError – Raised if keys is a pdarray with a dtype other than int64
Notes
Integral pdarrays, Strings, and Categoricals are natively supported, but
float64 and bool arrays are not.
For a user-defined class to be groupable, it must inherit from pdarray
and define or overload the grouping API:
a ._get_grouping_keys() method that returns a list of pdarrays
that can be (co)argsorted.
(Optional) a .group() method that returns the permutation that
groups the array
If the input is a single array with a .group() method defined, method 2
will be used; otherwise, method 1 will be used.
Using the permutation stored in the GroupBy instance, group
another array of values and return the location of the first
maximum of each group’s values.
Parameters:
values (pdarray) – The values to group and find argmax
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argmaxima (pdarray, int64) – One index per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax
is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or
if the operator is not in the GroupBy.Reductions array
Notes
The returned indices refer to the original values array as passed in,
not the permutation applied by the GroupBy instance.
Using the permutation stored in the GroupBy instance, group
another array of values and return the location of the first
minimum of each group’s values.
Parameters:
values (pdarray) – The values to group and find argmin
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argminima (pdarray, int64) – One index per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax
is not supported for the values dtype
ValueError – Raised if the key array size does not match the values
size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if argmin is not supported for the values dtype
Notes
The returned indices refer to the original values array as
passed in, not the permutation applied by the GroupBy instance.
values (pdarray, Strings) – The values to put in each group’s segment
permute (bool) – If True (default), permute broadcast values back to the ordering
of the original array on which GroupBy was called. If False, the
broadcast values are grouped by value.
TypeError – Raised if value is not a pdarray object
ValueError – Raised if the values array does not have one
value per segment
Notes
This function is a sparse analog of np.broadcast. If a
GroupBy object represents a sparse matrix (tensor), then
this function takes a (dense) column vector and replicates
each value to the non-zero elements in the corresponding row.
Examples
>>> a=ak.array([0,1,0,1,0])>>> values=ak.array([3,5])>>> g=ak.GroupBy(a)# By default, result is in original order>>> g.broadcast(values)array([3, 5, 3, 5, 3])# With permute=False, result is in grouped order>>> g.broadcast(values,permute=False)array([3, 3, 3, 5, 5]>>> a=ak.randint(1,5,10)>>> aarray([3, 1, 4, 4, 4, 1, 3, 3, 2, 2])>>> g=ak.GroupBy(a)>>> keys,counts=g.size()>>> g.broadcast(counts>2)array([True False True True True False True True False False])>>> g.broadcast(counts==3)array([True False True True True False True True False False])>>> g.broadcast(counts<4)array([True True True True True True True True True True])
function to build a new GroupBy object from component keys and permutation.
Parameters:
user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name
kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”
Returns:
The GroupBy object created by using the given components
values ((list of) pdarray-like) – The values from which to select, according to their group membership.
n (int, optional, default = 5) – Maximum number of items to return for each group.
If the number of values in a group is less than n,
all the values from that group will be returned.
return_indices (bool, default False) – If True, return the indices of the sampled values.
Otherwise, return the selected values.
Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result (pdarray-like) – The first n items of each group.
If return_indices is True, the result are indices.
O.W. the result are values.
Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to ‘strict’.
Register this GroupBy object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name the GroupBy is to be registered under,
this will be the root name for underlying components
Returns:
The same GroupBy which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different GroupBys with the same name.
Return a random sample from each group. You can either specify the number of elements
or the fraction of elements to be sampled. random_state can be used for reproducibility
Parameters:
values ((list of) pdarray-like) – The values from which to sample, according to their group membership.
n (int, optional) – Number of items to return for each group.
Cannot be used with frac and must be no larger than
the smallest group unless replace is True.
Default is one if frac is None.
frac (float, optional) – Fraction of items to return. Cannot be used with n.
replace (bool, default False) – Allow or disallow sampling of the value more than once.
weights (pdarray, optional) – Default None results in equal probability weighting.
If passed a pdarray, then values must have the same length as the groupby keys
and will be used as sampling probabilities after normalization within each group.
Weights must be non-negative with at least one positive element within each group.
random_state (int or ak.random.Generator, optional) – If int, seed for random number generator.
If ak.random.Generator, use as given.
return_indices (bool, default False) – if True, return the indices of the sampled values.
Otherwise, return the sample values.
permute_samples (bool, default False) – if True, return permute the samples according to group
Otherwise, keep samples in original order.
Returns:
if return_indices is True, return the indices of the sampled values.
Otherwise, return the sample values.
Using the permutation stored in the GroupBy instance, group
another array of values and compute the standard deviation of
each group’s values.
Parameters:
values (pdarray) – The values to group and find standard deviation
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_stds (pdarray, float64) – One std value per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size
or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The standard deviation is the square root of the average of the squared
deviations from the mean, i.e., std=sqrt(mean((x-x.mean())**2)).
The average squared deviation is normally calculated as
x.sum()/N, where N=len(x). If, however, ddof is specified,
the divisor N-ddof is used instead. In standard statistical
practice, ddof=1 provides an unbiased estimator of the variance
of the infinite population. ddof=0 provides a maximum likelihood
estimate of the variance for normally distributed variables. The
standard deviation computed in this function is the square root of
the estimated variance, so even with ddof=1, it will not be an
unbiased estimate of the standard deviation per se.
values ((list of) pdarray-like) – The values from which to select, according to their group membership.
n (int, optional, default = 5) – Maximum number of items to return for each group.
If the number of values in a group is less than n,
all the values from that group will be returned.
return_indices (bool, default False) – If True, return the indices of the sampled values.
Otherwise, return the selected values.
Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result (pdarray-like) – The last n items of each group.
If return_indices is True, the result are indices.
O.W. the result are values.
Save the GroupBy to HDF5. The result is a collection of HDF5 files, one file
per locale of the arkouda server, where each filename starts with prefix_path.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist.
If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Using the permutation stored in the GroupBy instance, group
another array of values and compute the variance of
each group’s values.
Parameters:
values (pdarray) – The values to group and find variance
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_vars (pdarray, float64) – One var value per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size
or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The variance is the average of the squared deviations from the mean,
i.e., var=mean((x-x.mean())**2).
The mean is normally calculated as x.sum()/N, where N=len(x).
If, however, ddof is specified, the divisor N-ddof is used
instead. In standard statistical practice, ddof=1 provides an
unbiased estimator of the variance of a hypothetical infinite population.
ddof=0 provides a maximum likelihood estimate of the variance for
normally distributed variables.
If True, and the groupby keys contain NaN values,
the NaN values together with the corresponding row will be dropped.
Otherwise, the rows corresponding to NaN values will be kept.
Type:
bool (default=True)
Raises:
TypeError – Raised if keys is a pdarray with a dtype other than int64
Notes
Integral pdarrays, Strings, and Categoricals are natively supported, but
float64 and bool arrays are not.
For a user-defined class to be groupable, it must inherit from pdarray
and define or overload the grouping API:
a ._get_grouping_keys() method that returns a list of pdarrays
that can be (co)argsorted.
(Optional) a .group() method that returns the permutation that
groups the array
If the input is a single array with a .group() method defined, method 2
will be used; otherwise, method 1 will be used.
Using the permutation stored in the GroupBy instance, group
another array of values and return the location of the first
maximum of each group’s values.
Parameters:
values (pdarray) – The values to group and find argmax
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argmaxima (pdarray, int64) – One index per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax
is not supported for the values dtype
ValueError – Raised if the key array size does not match the values size or
if the operator is not in the GroupBy.Reductions array
Notes
The returned indices refer to the original values array as passed in,
not the permutation applied by the GroupBy instance.
Using the permutation stored in the GroupBy instance, group
another array of values and return the location of the first
minimum of each group’s values.
Parameters:
values (pdarray) – The values to group and find argmin
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_argminima (pdarray, int64) – One index per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object or if argmax
is not supported for the values dtype
ValueError – Raised if the key array size does not match the values
size or if the operator is not in the GroupBy.Reductions array
RuntimeError – Raised if argmin is not supported for the values dtype
Notes
The returned indices refer to the original values array as
passed in, not the permutation applied by the GroupBy instance.
values (pdarray, Strings) – The values to put in each group’s segment
permute (bool) – If True (default), permute broadcast values back to the ordering
of the original array on which GroupBy was called. If False, the
broadcast values are grouped by value.
TypeError – Raised if value is not a pdarray object
ValueError – Raised if the values array does not have one
value per segment
Notes
This function is a sparse analog of np.broadcast. If a
GroupBy object represents a sparse matrix (tensor), then
this function takes a (dense) column vector and replicates
each value to the non-zero elements in the corresponding row.
Examples
>>> a=ak.array([0,1,0,1,0])>>> values=ak.array([3,5])>>> g=ak.GroupBy(a)# By default, result is in original order>>> g.broadcast(values)array([3, 5, 3, 5, 3])# With permute=False, result is in grouped order>>> g.broadcast(values,permute=False)array([3, 3, 3, 5, 5]>>> a=ak.randint(1,5,10)>>> aarray([3, 1, 4, 4, 4, 1, 3, 3, 2, 2])>>> g=ak.GroupBy(a)>>> keys,counts=g.size()>>> g.broadcast(counts>2)array([True False True True True False True True False False])>>> g.broadcast(counts==3)array([True False True True True False True True False False])>>> g.broadcast(counts<4)array([True True True True True True True True True True])
function to build a new GroupBy object from component keys and permutation.
Parameters:
user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name
kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”
Returns:
The GroupBy object created by using the given components
values ((list of) pdarray-like) – The values from which to select, according to their group membership.
n (int, optional, default = 5) – Maximum number of items to return for each group.
If the number of values in a group is less than n,
all the values from that group will be returned.
return_indices (bool, default False) – If True, return the indices of the sampled values.
Otherwise, return the selected values.
Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result (pdarray-like) – The first n items of each group.
If return_indices is True, the result are indices.
O.W. the result are values.
Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to ‘strict’.
Register this GroupBy object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name the GroupBy is to be registered under,
this will be the root name for underlying components
Returns:
The same GroupBy which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different GroupBys with the same name.
Return a random sample from each group. You can either specify the number of elements
or the fraction of elements to be sampled. random_state can be used for reproducibility
Parameters:
values ((list of) pdarray-like) – The values from which to sample, according to their group membership.
n (int, optional) – Number of items to return for each group.
Cannot be used with frac and must be no larger than
the smallest group unless replace is True.
Default is one if frac is None.
frac (float, optional) – Fraction of items to return. Cannot be used with n.
replace (bool, default False) – Allow or disallow sampling of the value more than once.
weights (pdarray, optional) – Default None results in equal probability weighting.
If passed a pdarray, then values must have the same length as the groupby keys
and will be used as sampling probabilities after normalization within each group.
Weights must be non-negative with at least one positive element within each group.
random_state (int or ak.random.Generator, optional) – If int, seed for random number generator.
If ak.random.Generator, use as given.
return_indices (bool, default False) – if True, return the indices of the sampled values.
Otherwise, return the sample values.
permute_samples (bool, default False) – if True, return permute the samples according to group
Otherwise, keep samples in original order.
Returns:
if return_indices is True, return the indices of the sampled values.
Otherwise, return the sample values.
Using the permutation stored in the GroupBy instance, group
another array of values and compute the standard deviation of
each group’s values.
Parameters:
values (pdarray) – The values to group and find standard deviation
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_stds (pdarray, float64) – One std value per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size
or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The standard deviation is the square root of the average of the squared
deviations from the mean, i.e., std=sqrt(mean((x-x.mean())**2)).
The average squared deviation is normally calculated as
x.sum()/N, where N=len(x). If, however, ddof is specified,
the divisor N-ddof is used instead. In standard statistical
practice, ddof=1 provides an unbiased estimator of the variance
of the infinite population. ddof=0 provides a maximum likelihood
estimate of the variance for normally distributed variables. The
standard deviation computed in this function is the square root of
the estimated variance, so even with ddof=1, it will not be an
unbiased estimate of the standard deviation per se.
values ((list of) pdarray-like) – The values from which to select, according to their group membership.
n (int, optional, default = 5) – Maximum number of items to return for each group.
If the number of values in a group is less than n,
all the values from that group will be returned.
return_indices (bool, default False) – If True, return the indices of the sampled values.
Otherwise, return the selected values.
Returns:
unique_keys ((list of) pdarray-like) – The unique keys, in grouped order
result (pdarray-like) – The last n items of each group.
If return_indices is True, the result are indices.
O.W. the result are values.
Save the GroupBy to HDF5. The result is a collection of HDF5 files, one file
per locale of the arkouda server, where each filename starts with prefix_path.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist.
If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Using the permutation stored in the GroupBy instance, group
another array of values and compute the variance of
each group’s values.
Parameters:
values (pdarray) – The values to group and find variance
skipna (bool) – boolean which determines if NANs should be skipped
ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var
Returns:
unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order
group_vars (pdarray, float64) – One var value per unique key in the GroupBy instance
Raises:
TypeError – Raised if the values array is not a pdarray object
ValueError – Raised if the key array size does not match the values size
or if the operator is not in the GroupBy.Reductions array
Notes
The return dtype is always float64.
The variance is the average of the squared deviations from the mean,
i.e., var=mean((x-x.mean())**2).
The mean is normally calculated as x.sum()/N, where N=len(x).
If, however, ddof is specified, the divisor N-ddof is used
instead. In standard statistical practice, ddof=1 provides an
unbiased estimator of the variance of a hypothetical infinite population.
ddof=0 provides a maximum likelihood estimate of the variance for
normally distributed variables.
This class is a thin wrapper around pdarray that mostly affects
how values are displayed to the user. Operators and methods will
typically treat this class like an int64 pdarray.
Register this IPv4 object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name the IPv4 is to be registered under,
this will be the root name for underlying components
Returns:
The same IPv4 which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support
a fluid programming style.
Please note you cannot register two different IPv4s with the same name.
name (str, default=None) – Name to be stored in the index.
False (allow_list =) – If False, list values will be converted to a pdarray.
If True, list values will remain as a list, provided the data length is less than max_list_size.
:paramIf False, list values will be converted to a pdarray.
If True, list values will remain as a list, provided the data length is less than max_list_size.
Parameters:
1000 (max_list_size =) – This is the maximum allowed data length for the values to be stored as a list object.
Raises:
ValueError – Raised if allow_list=True and the size of values is > max_list_size.
Register this Index object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name the Index is to be registered under,
this will be the root name for underlying components
Returns:
The same Index which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support
a fluid programming style.
Please note you cannot register two different Indexes with the same name.
DEPRECATED
Save the index to HDF5 or Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If
‘Parquet’, the files will be written to the Parquet file format. This
is case insensitive.
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
ValueError – Raised if there is an error in parsing the prefix path pointing to
file write location or if the mode parameter is neither truncate
nor append
TypeError – Raised if any one of the prefix_path, dataset, or mode parameters
is not a string.
Raised if the Index values are a list.
The prefix_path must be visible to the arkouda server and the user must
have write permission.
Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales. If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
Previously all files saved in Parquet format were saved with a .parquet file extension.
This will require you to use load as if you saved the file with the extension. Try this if
an older file is not being found.
Any file extension can be used. The file I/O does not rely on the extension to determine the
file format.
Write Index to CSV file(s). File will contain a single column with the pdarray data.
All CSV Files written by Arkouda include a header denoting data types of the columns.
prefix_path: str
The filename prefix to be used for saving files. Files will have _LOCALE#### appended
when they are written to disk.
dataset: str
Column name to save the pdarray under. Defaults to “array”.
col_delim: str
Defaults to “,”. Value to be used to separate columns within the file.
Please be sure that the value used DOES NOT appear in your dataset.
overwrite: bool
Defaults to False. If True, any existing files matching your provided prefix_path will
be overwritten. If False, an error will be returned if existing files are found.
str reponse message
ValueError
Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist.
RuntimeError
Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError
Raised if we receive an unknown arkouda_type returned from the server.
Raised if the Index values are a list.
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
Save the Index to HDF5.
The object can be saved to a collection of files or single file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
TypeError – Raised if the Index values are a list.
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’. Otherwise,
the file name will be prefix_path.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Save the Index to Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
TypeError – Raised if the Index values are a list.
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’.
- ‘append’ write mode is supported, but is not efficient.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Overwrite the dataset with the name provided with this Index object. If
the dataset does not exist it is added.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True
HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Return type:
str - success message if successful
Raises:
RuntimeError – Raised if a server-side error is thrown saving the index
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Because HDF5 deletes do not release memory, this will create a copy of the
file with the new data
name (str, default=None) – Name to be stored in the index.
False (allow_list =) – If False, list values will be converted to a pdarray.
If True, list values will remain as a list, provided the data length is less than max_list_size.
:paramIf False, list values will be converted to a pdarray.
If True, list values will remain as a list, provided the data length is less than max_list_size.
Parameters:
1000 (max_list_size =) – This is the maximum allowed data length for the values to be stored as a list object.
Raises:
ValueError – Raised if allow_list=True and the size of values is > max_list_size.
Register this Index object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name the Index is to be registered under,
this will be the root name for underlying components
Returns:
The same Index which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support
a fluid programming style.
Please note you cannot register two different Indexes with the same name.
Save the Index to HDF5.
The object can be saved to a collection of files or single file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray.
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’. Otherwise,
the file name will be prefix_path.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Overwrite the dataset with the name provided with this Index object. If
the dataset does not exist it is added.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True
HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Return type:
str - success message if successful
Raises:
RuntimeError – Raised if a server-side error is thrown saving the index
TypeError – Raised if the Index levels are a list.
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Because HDF5 deletes do not release memory, this will create a copy of the
file with the new data
D.update([E, ]**F) -> None. Update D from dict/iterable E and F.
If E is present and has a .keys() method, then does: for k in E: D[k] = E[k]
If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v
In either case, this is followed by: for k in F: D[k] = F[k]
Append other to self, either vertically (axis=0, length of resulting SegArray
increases), or horizontally (axis=1, each sub-array of other appends to the
corresponding sub-array of self).
Select the j-th element of each sub-array, where possible.
Parameters:
j (int) – The index of the value to get from each sub-array. If j is negative,
it counts backwards from the end of each sub-array.
return_origins (bool) – If True, return a logical index indicating where j is in bounds
compressed (bool) – If False, return array is same size as self, with default value
where j is out of bounds. If True, the return array only contains
values where j is in bounds.
default (scalar) – When compressed=False, the value to return when j is out of bounds
for the sub-array
Returns:
val (pdarray) – compressed=False: The j-th value of each sub-array where j is in
bounds and the default value where j is out of bounds.
compressed=True: The j-th values of only the sub-arrays where j is
in bounds
origin_indices (pdarray, bool) – A Boolean array that is True where j is in bounds for the sub-array.
Notes
If values are Strings, only the compressed format is supported.
Return all sub-arrays of length n, as a list of columns.
Parameters:
n (int) – Length of sub-arrays to select
return_origins (bool) – Return a logical index indicating which sub-arrays are length n
Returns:
columns (list of pdarray) – An n-long list of pdarray, where each row is one of the n-long
sub-arrays from the SegArray. The number of rows is the number of
True values in the returned mask.
origin_indices (pdarray, bool) – Array of bool for each element of the SegArray, True where sub-array
has length n.
Return all sub-array prefixes of length n (for sub-arrays that are at least n+1 long)
Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which sub-arrays
were long enough to return an n-prefix
proper (bool) – If True, only return proper prefixes, i.e. from sub-arrays
that are at least n+1 long. If False, allow the entire
sub-array to be returned as a prefix.
Returns:
prefixes (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-prefix.
The number of rows is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the sub-array was long enough to return
an n-suffix, False otherwise.
Return the n-long suffix of each sub-array, where possible
Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which sub-arrays
were long enough to return an n-suffix
proper (bool) – If True, only return proper suffixes, i.e. from sub-arrays
that are at least n+1 long. If False, allow the entire
sub-array to be returned as a suffix.
Returns:
suffixes (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-suffix.
The number of rows is the number of True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the sub-array was long enough to return
an n-suffix, False otherwise.
Register this SegArray object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name which this SegArray object will be registered under
Returns:
The same SegArray which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support
a fluid programming style.
Please note you cannot register two different SegArrays with the same name.
Condense sequences of repeated values within a sub-array to a single value.
Parameters:
return_multiplicity (bool) – If True, also return the number of times each value was repeated.
Returns:
norepeats (SegArray) – Sub-arrays with runs of repeated values replaced with single value
multiplicity (SegArray) – If return_multiplicity=True, this array contains the number of times
each value in the returned SegArray was repeated in the original SegArray.
DEPRECATED
Save the SegArray to HDF5.
The object can be saved to a collection of files or single file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’. Otherwise,
the file name will be prefix_path.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Save the SegArray to HDF5. The result is a collection of HDF5 files, one file
per locale of the arkouda server, where each filename starts with prefix_path.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files will share
dataset (str) – Name prefix for saved data within the HDF5 file
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist.
If ‘append’, add data as a new column to existing files.
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Save the SegArray object to Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. Each locale saves its chunk of the object to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: Deprecated.
Parameter kept to maintain functionality of other calls. Only Truncate
supported.
By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
ValueError – If write mode is not Truncate.
Notes
Append mode for Parquet has been deprecated. It was not implemented for SegArray.
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Sends a Segmented Array to a different Arkouda server
Parameters:
hostname (str) – The hostname where the Arkouda server intended to
receive the Segmented Array is running.
port (int_scalars) – The port to send the array over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
ak.receive_array().
Return type:
A message indicating a complete transfer
Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not
a supported dtype
Overwrite the dataset with the name provided with this SegArray object. If
the dataset does not exist it is added.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True
HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Return type:
None
Raises:
RuntimeError – Raised if a server-side error is thrown saving the SegArray
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Because HDF5 deletes do not release memory, this will create a copy of the
file with the new data
index (pdarray, Strings) – an array of indices associated with the data array.
If empty, it will default to a range of ints whose size match the size of the data.
optional
data (Tuple, List, groupable_element_type, Series, SegArray) – a 1D array. Must not be None.
Raises:
TypeError – Raised if index is not a pdarray or Strings object
Raised if data is not a pdarray, Strings, or Categorical object
ValueError – Raised if the index size does not match data size
Notes
The Series class accepts either positional arguments or keyword arguments.
If entering positional arguments,
2 arguments entered:
argument 1 - data
argument 2 - index
1 argument entered:
argument 1 - data
If entering 1 positional argument, it is assumed that this is the data argument.
If only ‘data’ argument is passed in, Index will automatically be generated.
If entering keywords,
‘data’ (see Parameters)
‘index’ (optional) must match size of ‘data’
Concatenate in arkouda a list of arkouda Series or grouped arkouda arrays horizontally or
vertically. If a list of grouped arkouda arrays is passed they are converted to a series. Each
grouping is a 2-tuple with the first item being the key(s) and the second being the value.
If horizontal, each series or grouping must have the same length and the same index. The index
of the series is converted to a column in the dataframe. If it is a multi-index,each level is
converted to a column.
arrays: The list of series/groupings to concat.
axis : Whether or not to do a verticle (axis=0) or horizontal (axis=1) concatenation
index_labels: column names(s) to label the index.
value_labels: column names to label values of each series.
ordered: If True (default), the arrays will be appended in the order given. If False, array
data may be interleaved in blocks, which can greatly improve performance but
results in non-deterministic ordering of elements.
axis=0: an arkouda series.
axis=1: an arkouda dataframe.
value (scalar, Series, or pdarray) – Value to use to fill holes (e.g. 0), alternately a
Series of values specifying which value to use for
each index. Values not in the Series will not be filled.
This value cannot be a list.
Return a boolean same-sized object indicating if the values are NA. NA values,
such as numpy.NaN, gets mapped to True values.
Everything else gets mapped to False values.
Characters such as empty strings ‘’ are not considered NA values.
Returns:
Mask of bool values for each element in Series
that indicates whether an element is an NA value.
Return a boolean same-sized object indicating if the values are NA. NA values,
such as numpy.NaN, gets mapped to True values.
Everything else gets mapped to False values.
Characters such as empty strings ‘’ are not considered NA values.
Returns:
Mask of bool values for each element in Series
that indicates whether an element is an NA value.
The input can be a scalar, a list of scalers, or a list of lists (if the series has a
MultiIndex). As a special case, if a Series is used as the key, the series labels are
preserved with its values use as the key.
Keys will be turned into arkouda arrays as needed.
A Series containing the values corresponding to the key.
Map values of Series according to an input mapping.
Parameters:
arg (dict or Series) – The mapping correspondence.
Returns:
A new series with the same index as the caller.
When the input Series has Categorical values,
the return Series will have Strings values.
Otherwise, the return type will match the input type.
Return a boolean same-sized object indicating if the values are not NA.
Non-missing values get mapped to True.
Characters such as empty strings ‘’ are not considered NA values.
NA values, such as numpy.NaN, get mapped to False values.
Returns:
Mask of bool values for each element in Series
that indicates whether an element is not an NA value.
Return a boolean same-sized object indicating if the values are not NA.
Non-missing values get mapped to True.
Characters such as empty strings ‘’ are not considered NA values.
NA values, such as numpy.NaN, get mapped to False values.
Returns:
Mask of bool values for each element in Series
that indicates whether an element is not an NA value.
Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to ‘strict’.
Concatenate a list of arkouda Series or grouped arkouda arrays, returning a PANDAS object.
If a list of grouped arkouda arrays is passed they are converted to a series. Each grouping
is a 2-tuple with the first item being the key(s) and the second being the value.
If horizontal, each series or grouping must have the same length and the same index. The index of
the series is converted to a column in the dataframe. If it is a multi-index,each level is
converted to a column.
arrays: The list of series/groupings to concat.
axis : Whether or not to do a verticle (axis=0) or horizontal (axis=1) concatenation
labels: names to give the columns of the data frame.
axis=0: a local PANDAS series
axis=1: a local PANDAS dataframe
Register this Series object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name the Series is to be registered under,
this will be the root name for underlying components
Returns:
The same Series which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support
a fluid programming style.
Please note you cannot register two different Series with the same name.
storage_options (dict, optional) – Extra options that make sense for a particular storage connection,
e.g. host, port, username, password, etc., if using a URL that will be parsed by fsspec,
e.g., starting “s3://”, “gcs://”.
An error will be raised if providing this argument with a non-fsspec URL.
See the fsspec and backend storage implementation docs for the set
of allowed keys and values.
**kwargs – These parameters will be passed to tabulate.
D.update([E, ]**F) -> None. Update D from dict/iterable E and F.
If E is present and has a .keys() method, then does: for k in E: D[k] = E[k]
If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v
In either case, this is followed by: for k in F: D[k] = F[k]
Represents an array of strings whose data resides on the
arkouda server. The user should not call this class directly;
rather its instances are created by other arkouda functions.
Strings is composed of two pdarrays: (1) offsets, which contains the
starting indices for each string and (2) bytes, which contains the
raw bytes of all strings, delimited by nulls.
>>> strings=ak.array([f'StrINgS aRe Here {i}'foriinrange(5)])>>> stringsarray(['StrINgS aRe Here 0', 'StrINgS aRe Here 1', 'StrINgS aRe Here 2', 'StrINgS aRe Here 3',... 'StrINgS aRe Here 4'])>>> strings.title()array(['Strings are here 0', 'Strings are here 1', 'Strings are here 2', 'Strings are here 3',... 'Strings are here 4'])
Check whether each element contains the given substring.
Parameters:
substr (str_scalars) – The substring in the form of string or byte array to search for
regex (bool) – Indicates whether substr is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
True for elements that contain substr, False otherwise
regex (bool) – Indicates whether substr is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
True for elements that end with substr, False otherwise
Unpack delimiter-joined substrings into a flat array.
Parameters:
delimiter (str) – Characters used to split strings into substrings
return_segments (bool) – If True, also return mapping of original strings to first substring
in return array.
regex (bool) – Indicates whether delimiter is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
Strings – Flattened substrings with delimiters removed
pdarray, int64 (optional) – For each original string, the index of first corresponding substring
in the return array
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
This factory method is used when we construct the parts of a Strings
object on the client side and transfer the offsets & bytes separately
to the server. This results in two entries in the symbol table and we
need to instruct the server to assemble the into a composite entity.
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
We really don’t have an itemsize because these are variable length strings.
In the future we could probably use this position to store the total bytes.
Return the n-long prefix of each string, where possible
Parameters:
n (int) – Length of prefix
return_origins (bool) – If True, return a logical index indicating which strings
were long enough to return an n-prefix
proper (bool) – If True, only return proper prefixes, i.e. from strings
that are at least n+1 long. If False, allow the entire
string to be returned as a prefix.
Returns:
prefixes (Strings) – The array of n-character prefixes; the number of elements is the number of
True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return
an n-character prefix, False otherwise.
Return the n-long suffix of each string, where possible
Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which strings
were long enough to return an n-suffix
proper (bool) – If True, only return proper suffixes, i.e. from strings
that are at least n+1 long. If False, allow the entire
string to be returned as a suffix.
Returns:
suffixes (Strings) – The array of n-character suffixes; the number of elements is the number of
True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return
an n-character suffix, False otherwise.
Return the permutation that groups the array, placing equivalent
strings together. All instances of the same string are guaranteed to lie
in one contiguous block of the permuted array, but the blocks are not
necessarily ordered.
If the arkouda server is compiled with “-sSegmentedString.useHash=true”,
then arkouda uses 128-bit hash values to group strings, rather than sorting
the strings directly. This method is fast, but the resulting permutation
merely groups equivalent strings and does not sort them. If the “useHash”
parameter is false, then a full sort is performed.
Raises:
RuntimeError – Raised if there is a server-side error in executing group request or
creating the pdarray encapsulating the return message
The implementation uses SipHash128, a fast and balanced hash function (used
by Python for dictionaries and sets). For realistic numbers of strings (up
to about 10**15), the probability of a collision between two 128-bit hash
values is negligible.
Returns a boolean pdarray where index i indicates whether string i of the
Strings is alphabetic. This means there is at least one character,
and all the characters are alphabetic.
Returns:
True for elements that are alphabetic, False otherwise
Join the strings from another array onto the left of the strings
of this array, optionally inserting a delimiter.
Warning: This function is experimental and not guaranteed to work.
Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (Union[bytes,str_scalars]) – String inserted between self and other
Peel off one or more delimited fields from each string (similar
to string.partition), returning two new arrays of strings.
Warning: This function is experimental and not guaranteed to work.
Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over
the first (times-1) delimiters
includeDelimiter (bool) – If true, append the delimiter to the end of the first return
array. By default, it is prepended to the beginning of the
second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of
the delimiter will be returned in the first array. By default,
such strings are returned in the second array.
fromRight (bool) – If true, peel from the right instead of the left (see also rpeel)
regex (bool) – Indicates whether delimiter is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
left: Strings
The field(s) peeled from the end of each string (unless
fromRight is true)
right: Strings
The remainder of each string after peeling (unless fromRight
is true)
TypeError – Raised if the delimiter parameter is not byte or str_scalars, if
times is not int64, or if includeDelimiter, keepPartial, or
fromRight is not bool
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Register this Strings object with a user defined name in the arkouda server
so it can be attached to later using Strings.attach()
This is an in-place operation, registering a Strings object more than once will
update the name in the registry and remove the previously registered name.
A name can only be registered to one object at a time.
Parameters:
user_defined_name (str) – user defined name which the Strings object is to be registered under
Returns:
The same Strings object which is now registered with the arkouda server and
has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different objects with the same name.
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Strings object with the user_defined_name
If the user is attempting to register more than one object with the same name,
the former should be unregistered first to free up the registration name.
Peel off one or more delimited fields from the end of each string
(similar to string.rpartition), returning two new arrays of strings.
Warning: This function is experimental and not guaranteed to work.
Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over
the last (times-1) delimiters
includeDelimiter (bool) – If true, prepend the delimiter to the start of the first return
array. By default, it is appended to the end of the
second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of
the delimiter will be returned in the second array. By default,
such strings are returned in the first array.
regex (bool) – Indicates whether delimiter is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
left: Strings
The remainder of the string after peeling
right: Strings
The field(s) that were peeled from the right of each string
DEPRECATED
Save the Strings object to HDF5 or Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. HDF5 support single files, in which case the file name will
only be that provided. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: The name of the Strings dataset to be written, defaults to strings_array
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, create a new Strings dataset within existing files.
Parameters:
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5
If False the offsets array will not be save and will be derived from the string values
upon load/read. This is not supported for Parquet files.
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
file_format (str) – By default, saved files will be written to the HDF5 file format. If
‘Parquet’, the files will be written to the Parquet file format. This
is case insensitive.
file_type (str ("single" | "distribute")) – Default: Distribute
Distribute the dataset over a file per locale.
Single file will save the dataset to one file
Return type:
String message indicating result of save operation
Notes
Important implementation notes: (1) Strings state is saved as two datasets
within an hdf5 group: one for the string characters and one for the
segments corresponding to the start of each string, (2) the hdf5 group is named
via the dataset parameter. (3) Parquet files do not store the segments,
only the values.
Returns a match object with the first location in each element where pattern produces a match.
Elements match if any part of the string matches the regular expression pattern
Parameters:
pattern (str) – Regex used to find matches
Returns:
Match object where elements match if any part of the string matches the
regular expression pattern
regex (bool) – Indicates whether substr is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
True for elements that start with substr, False otherwise
Join the strings from another array onto one end of the strings
of this array, optionally inserting a delimiter.
Warning: This function is experimental and not guaranteed to work.
Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (str) – String inserted between self and other
toLeft (bool) – If true, join other strings to the left of self. By default,
other is joined to the right of self.
Returns a new Strings object with all leading and trailing occurrences of characters contained
in chars removed. The chars argument is a string specifying the set of characters to be removed.
If omitted, the chars argument defaults to removing whitespace. The chars argument is not a
prefix or suffix; rather, all combinations of its values are stripped.
Parameters:
chars – the set of characters to be removed
Returns:
Strings object with the leading and trailing characters matching the set of characters in
the chars argument removed
Return new Strings obtained by replacing non-overlapping occurrences of pattern with the
replacement repl.
If count is nonzero, at most count substitutions occur
Write Strings to CSV file(s). File will contain a single column with the Strings data.
All CSV Files written by Arkouda include a header denoting data types of the columns.
Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing
bytes as uint(8).
Parameters:
prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended
when they are written to disk.
dataset (str) – Column name to save the Strings under. Defaults to “strings_array”.
col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file.
Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will
be overwritten. If False, an error will be returned if existing files are found.
Return type:
str reponse message
Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (\n) at this time.
Save the Strings object to HDF5.
The object can be saved to a collection of files or single file.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – The name of the Strings dataset to be written, defaults to strings_array
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist.
If ‘append’, create a new Strings dataset within existing files.
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5
If False the offsets array will not be save and will be derived from the string values
upon load/read.
file_type (str ("single" | "distribute")) – Default: Distribute
Distribute the dataset over a file per locale.
Single file will save the dataset to one file
Return type:
String message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
Parquet files do not store the segments, only the values.
Strings state is saved as two datasets within an hdf5 group:
one for the string characters and one for the
segments corresponding to the start of each string
the hdf5 group is named via the dataset parameter.
The prefix_path must be visible to the arkouda server and the user must
have write permission.
Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’. Otherwise,
the file name will be prefix_path.
If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Convert the SegString to a list, transferring data from the
arkouda server to Python. If the SegString exceeds a built-in size limit,
a RuntimeError is raised.
Returns:
A list with the same strings as this SegString
Return type:
list
Notes
The number of bytes in the array cannot exceed ak.client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting ak.client.maxTransferBytes to a larger
value, but proceed with caution.
Convert the array to a np.ndarray, transferring array data from the
arkouda server to Python. If the array exceeds a built-in size limit,
a RuntimeError is raised.
Returns:
A numpy ndarray with the same strings as this array
Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed ak.client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting ak.client.maxTransferBytes to a larger
value, but proceed with caution.
Save the Strings object to Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’.
- ‘append’ write mode is supported, but is not efficient.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Sends a Strings object to a different Arkouda server
Parameters:
hostname (str) – The hostname where the Arkouda server intended to
receive the Strings object is running.
port (int_scalars) – The port to send the array over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
ak.receive_array().
Return type:
A message indicating a complete transfer
Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not
a supported dtype
Overwrite the dataset with the name provided with this Strings object. If
the dataset does not exist it is added
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5
If False the offsets array will not be save and will be derived from the string values
upon load/read.
repack (bool) – Default: True
HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Return type:
str - success message if successful
Raises:
RuntimeError – Raised if a server-side error is thrown saving the Strings object
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Represents an array of strings whose data resides on the
arkouda server. The user should not call this class directly;
rather its instances are created by other arkouda functions.
Strings is composed of two pdarrays: (1) offsets, which contains the
starting indices for each string and (2) bytes, which contains the
raw bytes of all strings, delimited by nulls.
>>> strings=ak.array([f'StrINgS aRe Here {i}'foriinrange(5)])>>> stringsarray(['StrINgS aRe Here 0', 'StrINgS aRe Here 1', 'StrINgS aRe Here 2', 'StrINgS aRe Here 3',... 'StrINgS aRe Here 4'])>>> strings.title()array(['Strings are here 0', 'Strings are here 1', 'Strings are here 2', 'Strings are here 3',... 'Strings are here 4'])
Check whether each element contains the given substring.
Parameters:
substr (str_scalars) – The substring in the form of string or byte array to search for
regex (bool) – Indicates whether substr is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
True for elements that contain substr, False otherwise
regex (bool) – Indicates whether substr is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
True for elements that end with substr, False otherwise
Unpack delimiter-joined substrings into a flat array.
Parameters:
delimiter (str) – Characters used to split strings into substrings
return_segments (bool) – If True, also return mapping of original strings to first substring
in return array.
regex (bool) – Indicates whether delimiter is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
Strings – Flattened substrings with delimiters removed
pdarray, int64 (optional) – For each original string, the index of first corresponding substring
in the return array
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
This factory method is used when we construct the parts of a Strings
object on the client side and transfer the offsets & bytes separately
to the server. This results in two entries in the symbol table and we
need to instruct the server to assemble the into a composite entity.
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
We really don’t have an itemsize because these are variable length strings.
In the future we could probably use this position to store the total bytes.
Return the n-long prefix of each string, where possible
Parameters:
n (int) – Length of prefix
return_origins (bool) – If True, return a logical index indicating which strings
were long enough to return an n-prefix
proper (bool) – If True, only return proper prefixes, i.e. from strings
that are at least n+1 long. If False, allow the entire
string to be returned as a prefix.
Returns:
prefixes (Strings) – The array of n-character prefixes; the number of elements is the number of
True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return
an n-character prefix, False otherwise.
Return the n-long suffix of each string, where possible
Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which strings
were long enough to return an n-suffix
proper (bool) – If True, only return proper suffixes, i.e. from strings
that are at least n+1 long. If False, allow the entire
string to be returned as a suffix.
Returns:
suffixes (Strings) – The array of n-character suffixes; the number of elements is the number of
True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return
an n-character suffix, False otherwise.
Return the permutation that groups the array, placing equivalent
strings together. All instances of the same string are guaranteed to lie
in one contiguous block of the permuted array, but the blocks are not
necessarily ordered.
If the arkouda server is compiled with “-sSegmentedString.useHash=true”,
then arkouda uses 128-bit hash values to group strings, rather than sorting
the strings directly. This method is fast, but the resulting permutation
merely groups equivalent strings and does not sort them. If the “useHash”
parameter is false, then a full sort is performed.
Raises:
RuntimeError – Raised if there is a server-side error in executing group request or
creating the pdarray encapsulating the return message
The implementation uses SipHash128, a fast and balanced hash function (used
by Python for dictionaries and sets). For realistic numbers of strings (up
to about 10**15), the probability of a collision between two 128-bit hash
values is negligible.
Returns a boolean pdarray where index i indicates whether string i of the
Strings is alphabetic. This means there is at least one character,
and all the characters are alphabetic.
Returns:
True for elements that are alphabetic, False otherwise
Join the strings from another array onto the left of the strings
of this array, optionally inserting a delimiter.
Warning: This function is experimental and not guaranteed to work.
Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (Union[bytes,str_scalars]) – String inserted between self and other
Peel off one or more delimited fields from each string (similar
to string.partition), returning two new arrays of strings.
Warning: This function is experimental and not guaranteed to work.
Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over
the first (times-1) delimiters
includeDelimiter (bool) – If true, append the delimiter to the end of the first return
array. By default, it is prepended to the beginning of the
second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of
the delimiter will be returned in the first array. By default,
such strings are returned in the second array.
fromRight (bool) – If true, peel from the right instead of the left (see also rpeel)
regex (bool) – Indicates whether delimiter is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
left: Strings
The field(s) peeled from the end of each string (unless
fromRight is true)
right: Strings
The remainder of each string after peeling (unless fromRight
is true)
TypeError – Raised if the delimiter parameter is not byte or str_scalars, if
times is not int64, or if includeDelimiter, keepPartial, or
fromRight is not bool
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Register this Strings object with a user defined name in the arkouda server
so it can be attached to later using Strings.attach()
This is an in-place operation, registering a Strings object more than once will
update the name in the registry and remove the previously registered name.
A name can only be registered to one object at a time.
Parameters:
user_defined_name (str) – user defined name which the Strings object is to be registered under
Returns:
The same Strings object which is now registered with the arkouda server and
has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different objects with the same name.
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Strings object with the user_defined_name
If the user is attempting to register more than one object with the same name,
the former should be unregistered first to free up the registration name.
Peel off one or more delimited fields from the end of each string
(similar to string.rpartition), returning two new arrays of strings.
Warning: This function is experimental and not guaranteed to work.
Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over
the last (times-1) delimiters
includeDelimiter (bool) – If true, prepend the delimiter to the start of the first return
array. By default, it is appended to the end of the
second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of
the delimiter will be returned in the second array. By default,
such strings are returned in the first array.
regex (bool) – Indicates whether delimiter is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
left: Strings
The remainder of the string after peeling
right: Strings
The field(s) that were peeled from the right of each string
DEPRECATED
Save the Strings object to HDF5 or Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. HDF5 support single files, in which case the file name will
only be that provided. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: The name of the Strings dataset to be written, defaults to strings_array
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, create a new Strings dataset within existing files.
Parameters:
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5
If False the offsets array will not be save and will be derived from the string values
upon load/read. This is not supported for Parquet files.
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
file_format (str) – By default, saved files will be written to the HDF5 file format. If
‘Parquet’, the files will be written to the Parquet file format. This
is case insensitive.
file_type (str ("single" | "distribute")) – Default: Distribute
Distribute the dataset over a file per locale.
Single file will save the dataset to one file
Return type:
String message indicating result of save operation
Notes
Important implementation notes: (1) Strings state is saved as two datasets
within an hdf5 group: one for the string characters and one for the
segments corresponding to the start of each string, (2) the hdf5 group is named
via the dataset parameter. (3) Parquet files do not store the segments,
only the values.
Returns a match object with the first location in each element where pattern produces a match.
Elements match if any part of the string matches the regular expression pattern
Parameters:
pattern (str) – Regex used to find matches
Returns:
Match object where elements match if any part of the string matches the
regular expression pattern
regex (bool) – Indicates whether substr is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
True for elements that start with substr, False otherwise
Join the strings from another array onto one end of the strings
of this array, optionally inserting a delimiter.
Warning: This function is experimental and not guaranteed to work.
Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (str) – String inserted between self and other
toLeft (bool) – If true, join other strings to the left of self. By default,
other is joined to the right of self.
Returns a new Strings object with all leading and trailing occurrences of characters contained
in chars removed. The chars argument is a string specifying the set of characters to be removed.
If omitted, the chars argument defaults to removing whitespace. The chars argument is not a
prefix or suffix; rather, all combinations of its values are stripped.
Parameters:
chars – the set of characters to be removed
Returns:
Strings object with the leading and trailing characters matching the set of characters in
the chars argument removed
Return new Strings obtained by replacing non-overlapping occurrences of pattern with the
replacement repl.
If count is nonzero, at most count substitutions occur
Write Strings to CSV file(s). File will contain a single column with the Strings data.
All CSV Files written by Arkouda include a header denoting data types of the columns.
Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing
bytes as uint(8).
Parameters:
prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended
when they are written to disk.
dataset (str) – Column name to save the Strings under. Defaults to “strings_array”.
col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file.
Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will
be overwritten. If False, an error will be returned if existing files are found.
Return type:
str reponse message
Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (\n) at this time.
Save the Strings object to HDF5.
The object can be saved to a collection of files or single file.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – The name of the Strings dataset to be written, defaults to strings_array
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist.
If ‘append’, create a new Strings dataset within existing files.
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5
If False the offsets array will not be save and will be derived from the string values
upon load/read.
file_type (str ("single" | "distribute")) – Default: Distribute
Distribute the dataset over a file per locale.
Single file will save the dataset to one file
Return type:
String message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
Parquet files do not store the segments, only the values.
Strings state is saved as two datasets within an hdf5 group:
one for the string characters and one for the
segments corresponding to the start of each string
the hdf5 group is named via the dataset parameter.
The prefix_path must be visible to the arkouda server and the user must
have write permission.
Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’. Otherwise,
the file name will be prefix_path.
If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Convert the SegString to a list, transferring data from the
arkouda server to Python. If the SegString exceeds a built-in size limit,
a RuntimeError is raised.
Returns:
A list with the same strings as this SegString
Return type:
list
Notes
The number of bytes in the array cannot exceed ak.client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting ak.client.maxTransferBytes to a larger
value, but proceed with caution.
Convert the array to a np.ndarray, transferring array data from the
arkouda server to Python. If the array exceeds a built-in size limit,
a RuntimeError is raised.
Returns:
A numpy ndarray with the same strings as this array
Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed ak.client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting ak.client.maxTransferBytes to a larger
value, but proceed with caution.
Save the Strings object to Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’.
- ‘append’ write mode is supported, but is not efficient.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Sends a Strings object to a different Arkouda server
Parameters:
hostname (str) – The hostname where the Arkouda server intended to
receive the Strings object is running.
port (int_scalars) – The port to send the array over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
ak.receive_array().
Return type:
A message indicating a complete transfer
Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not
a supported dtype
Overwrite the dataset with the name provided with this Strings object. If
the dataset does not exist it is added
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5
If False the offsets array will not be save and will be derived from the string values
upon load/read.
repack (bool) – Default: True
HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Return type:
str - success message if successful
Raises:
RuntimeError – Raised if a server-side error is thrown saving the Strings object
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Represents an array of strings whose data resides on the
arkouda server. The user should not call this class directly;
rather its instances are created by other arkouda functions.
Strings is composed of two pdarrays: (1) offsets, which contains the
starting indices for each string and (2) bytes, which contains the
raw bytes of all strings, delimited by nulls.
>>> strings=ak.array([f'StrINgS aRe Here {i}'foriinrange(5)])>>> stringsarray(['StrINgS aRe Here 0', 'StrINgS aRe Here 1', 'StrINgS aRe Here 2', 'StrINgS aRe Here 3',... 'StrINgS aRe Here 4'])>>> strings.title()array(['Strings are here 0', 'Strings are here 1', 'Strings are here 2', 'Strings are here 3',... 'Strings are here 4'])
Check whether each element contains the given substring.
Parameters:
substr (str_scalars) – The substring in the form of string or byte array to search for
regex (bool) – Indicates whether substr is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
True for elements that contain substr, False otherwise
regex (bool) – Indicates whether substr is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
True for elements that end with substr, False otherwise
Unpack delimiter-joined substrings into a flat array.
Parameters:
delimiter (str) – Characters used to split strings into substrings
return_segments (bool) – If True, also return mapping of original strings to first substring
in return array.
regex (bool) – Indicates whether delimiter is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
Strings – Flattened substrings with delimiters removed
pdarray, int64 (optional) – For each original string, the index of first corresponding substring
in the return array
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
This factory method is used when we construct the parts of a Strings
object on the client side and transfer the offsets & bytes separately
to the server. This results in two entries in the symbol table and we
need to instruct the server to assemble the into a composite entity.
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
We really don’t have an itemsize because these are variable length strings.
In the future we could probably use this position to store the total bytes.
Return the n-long prefix of each string, where possible
Parameters:
n (int) – Length of prefix
return_origins (bool) – If True, return a logical index indicating which strings
were long enough to return an n-prefix
proper (bool) – If True, only return proper prefixes, i.e. from strings
that are at least n+1 long. If False, allow the entire
string to be returned as a prefix.
Returns:
prefixes (Strings) – The array of n-character prefixes; the number of elements is the number of
True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return
an n-character prefix, False otherwise.
Return the n-long suffix of each string, where possible
Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which strings
were long enough to return an n-suffix
proper (bool) – If True, only return proper suffixes, i.e. from strings
that are at least n+1 long. If False, allow the entire
string to be returned as a suffix.
Returns:
suffixes (Strings) – The array of n-character suffixes; the number of elements is the number of
True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return
an n-character suffix, False otherwise.
Return the permutation that groups the array, placing equivalent
strings together. All instances of the same string are guaranteed to lie
in one contiguous block of the permuted array, but the blocks are not
necessarily ordered.
If the arkouda server is compiled with “-sSegmentedString.useHash=true”,
then arkouda uses 128-bit hash values to group strings, rather than sorting
the strings directly. This method is fast, but the resulting permutation
merely groups equivalent strings and does not sort them. If the “useHash”
parameter is false, then a full sort is performed.
Raises:
RuntimeError – Raised if there is a server-side error in executing group request or
creating the pdarray encapsulating the return message
The implementation uses SipHash128, a fast and balanced hash function (used
by Python for dictionaries and sets). For realistic numbers of strings (up
to about 10**15), the probability of a collision between two 128-bit hash
values is negligible.
Returns a boolean pdarray where index i indicates whether string i of the
Strings is alphabetic. This means there is at least one character,
and all the characters are alphabetic.
Returns:
True for elements that are alphabetic, False otherwise
Join the strings from another array onto the left of the strings
of this array, optionally inserting a delimiter.
Warning: This function is experimental and not guaranteed to work.
Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (Union[bytes,str_scalars]) – String inserted between self and other
Peel off one or more delimited fields from each string (similar
to string.partition), returning two new arrays of strings.
Warning: This function is experimental and not guaranteed to work.
Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over
the first (times-1) delimiters
includeDelimiter (bool) – If true, append the delimiter to the end of the first return
array. By default, it is prepended to the beginning of the
second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of
the delimiter will be returned in the first array. By default,
such strings are returned in the second array.
fromRight (bool) – If true, peel from the right instead of the left (see also rpeel)
regex (bool) – Indicates whether delimiter is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
left: Strings
The field(s) peeled from the end of each string (unless
fromRight is true)
right: Strings
The remainder of each string after peeling (unless fromRight
is true)
TypeError – Raised if the delimiter parameter is not byte or str_scalars, if
times is not int64, or if includeDelimiter, keepPartial, or
fromRight is not bool
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Register this Strings object with a user defined name in the arkouda server
so it can be attached to later using Strings.attach()
This is an in-place operation, registering a Strings object more than once will
update the name in the registry and remove the previously registered name.
A name can only be registered to one object at a time.
Parameters:
user_defined_name (str) – user defined name which the Strings object is to be registered under
Returns:
The same Strings object which is now registered with the arkouda server and
has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different objects with the same name.
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Strings object with the user_defined_name
If the user is attempting to register more than one object with the same name,
the former should be unregistered first to free up the registration name.
Peel off one or more delimited fields from the end of each string
(similar to string.rpartition), returning two new arrays of strings.
Warning: This function is experimental and not guaranteed to work.
Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over
the last (times-1) delimiters
includeDelimiter (bool) – If true, prepend the delimiter to the start of the first return
array. By default, it is appended to the end of the
second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of
the delimiter will be returned in the second array. By default,
such strings are returned in the first array.
regex (bool) – Indicates whether delimiter is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
left: Strings
The remainder of the string after peeling
right: Strings
The field(s) that were peeled from the right of each string
DEPRECATED
Save the Strings object to HDF5 or Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. HDF5 support single files, in which case the file name will
only be that provided. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: The name of the Strings dataset to be written, defaults to strings_array
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, create a new Strings dataset within existing files.
Parameters:
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5
If False the offsets array will not be save and will be derived from the string values
upon load/read. This is not supported for Parquet files.
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
file_format (str) – By default, saved files will be written to the HDF5 file format. If
‘Parquet’, the files will be written to the Parquet file format. This
is case insensitive.
file_type (str ("single" | "distribute")) – Default: Distribute
Distribute the dataset over a file per locale.
Single file will save the dataset to one file
Return type:
String message indicating result of save operation
Notes
Important implementation notes: (1) Strings state is saved as two datasets
within an hdf5 group: one for the string characters and one for the
segments corresponding to the start of each string, (2) the hdf5 group is named
via the dataset parameter. (3) Parquet files do not store the segments,
only the values.
Returns a match object with the first location in each element where pattern produces a match.
Elements match if any part of the string matches the regular expression pattern
Parameters:
pattern (str) – Regex used to find matches
Returns:
Match object where elements match if any part of the string matches the
regular expression pattern
regex (bool) – Indicates whether substr is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
True for elements that start with substr, False otherwise
Join the strings from another array onto one end of the strings
of this array, optionally inserting a delimiter.
Warning: This function is experimental and not guaranteed to work.
Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (str) – String inserted between self and other
toLeft (bool) – If true, join other strings to the left of self. By default,
other is joined to the right of self.
Returns a new Strings object with all leading and trailing occurrences of characters contained
in chars removed. The chars argument is a string specifying the set of characters to be removed.
If omitted, the chars argument defaults to removing whitespace. The chars argument is not a
prefix or suffix; rather, all combinations of its values are stripped.
Parameters:
chars – the set of characters to be removed
Returns:
Strings object with the leading and trailing characters matching the set of characters in
the chars argument removed
Return new Strings obtained by replacing non-overlapping occurrences of pattern with the
replacement repl.
If count is nonzero, at most count substitutions occur
Write Strings to CSV file(s). File will contain a single column with the Strings data.
All CSV Files written by Arkouda include a header denoting data types of the columns.
Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing
bytes as uint(8).
Parameters:
prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended
when they are written to disk.
dataset (str) – Column name to save the Strings under. Defaults to “strings_array”.
col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file.
Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will
be overwritten. If False, an error will be returned if existing files are found.
Return type:
str reponse message
Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (\n) at this time.
Save the Strings object to HDF5.
The object can be saved to a collection of files or single file.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – The name of the Strings dataset to be written, defaults to strings_array
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist.
If ‘append’, create a new Strings dataset within existing files.
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5
If False the offsets array will not be save and will be derived from the string values
upon load/read.
file_type (str ("single" | "distribute")) – Default: Distribute
Distribute the dataset over a file per locale.
Single file will save the dataset to one file
Return type:
String message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
Parquet files do not store the segments, only the values.
Strings state is saved as two datasets within an hdf5 group:
one for the string characters and one for the
segments corresponding to the start of each string
the hdf5 group is named via the dataset parameter.
The prefix_path must be visible to the arkouda server and the user must
have write permission.
Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’. Otherwise,
the file name will be prefix_path.
If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Convert the SegString to a list, transferring data from the
arkouda server to Python. If the SegString exceeds a built-in size limit,
a RuntimeError is raised.
Returns:
A list with the same strings as this SegString
Return type:
list
Notes
The number of bytes in the array cannot exceed ak.client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting ak.client.maxTransferBytes to a larger
value, but proceed with caution.
Convert the array to a np.ndarray, transferring array data from the
arkouda server to Python. If the array exceeds a built-in size limit,
a RuntimeError is raised.
Returns:
A numpy ndarray with the same strings as this array
Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed ak.client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting ak.client.maxTransferBytes to a larger
value, but proceed with caution.
Save the Strings object to Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’.
- ‘append’ write mode is supported, but is not efficient.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Sends a Strings object to a different Arkouda server
Parameters:
hostname (str) – The hostname where the Arkouda server intended to
receive the Strings object is running.
port (int_scalars) – The port to send the array over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
ak.receive_array().
Return type:
A message indicating a complete transfer
Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not
a supported dtype
Overwrite the dataset with the name provided with this Strings object. If
the dataset does not exist it is added
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5
If False the offsets array will not be save and will be derived from the string values
upon load/read.
repack (bool) – Default: True
HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Return type:
str - success message if successful
Raises:
RuntimeError – Raised if a server-side error is thrown saving the Strings object
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Represents an array of strings whose data resides on the
arkouda server. The user should not call this class directly;
rather its instances are created by other arkouda functions.
Strings is composed of two pdarrays: (1) offsets, which contains the
starting indices for each string and (2) bytes, which contains the
raw bytes of all strings, delimited by nulls.
>>> strings=ak.array([f'StrINgS aRe Here {i}'foriinrange(5)])>>> stringsarray(['StrINgS aRe Here 0', 'StrINgS aRe Here 1', 'StrINgS aRe Here 2', 'StrINgS aRe Here 3',... 'StrINgS aRe Here 4'])>>> strings.title()array(['Strings are here 0', 'Strings are here 1', 'Strings are here 2', 'Strings are here 3',... 'Strings are here 4'])
Check whether each element contains the given substring.
Parameters:
substr (str_scalars) – The substring in the form of string or byte array to search for
regex (bool) – Indicates whether substr is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
True for elements that contain substr, False otherwise
regex (bool) – Indicates whether substr is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
True for elements that end with substr, False otherwise
Unpack delimiter-joined substrings into a flat array.
Parameters:
delimiter (str) – Characters used to split strings into substrings
return_segments (bool) – If True, also return mapping of original strings to first substring
in return array.
regex (bool) – Indicates whether delimiter is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
Strings – Flattened substrings with delimiters removed
pdarray, int64 (optional) – For each original string, the index of first corresponding substring
in the return array
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
This factory method is used when we construct the parts of a Strings
object on the client side and transfer the offsets & bytes separately
to the server. This results in two entries in the symbol table and we
need to instruct the server to assemble the into a composite entity.
RuntimeError – Raised if there’s an error converting a server-returned str-descriptor
Notes
We really don’t have an itemsize because these are variable length strings.
In the future we could probably use this position to store the total bytes.
Return the n-long prefix of each string, where possible
Parameters:
n (int) – Length of prefix
return_origins (bool) – If True, return a logical index indicating which strings
were long enough to return an n-prefix
proper (bool) – If True, only return proper prefixes, i.e. from strings
that are at least n+1 long. If False, allow the entire
string to be returned as a prefix.
Returns:
prefixes (Strings) – The array of n-character prefixes; the number of elements is the number of
True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return
an n-character prefix, False otherwise.
Return the n-long suffix of each string, where possible
Parameters:
n (int) – Length of suffix
return_origins (bool) – If True, return a logical index indicating which strings
were long enough to return an n-suffix
proper (bool) – If True, only return proper suffixes, i.e. from strings
that are at least n+1 long. If False, allow the entire
string to be returned as a suffix.
Returns:
suffixes (Strings) – The array of n-character suffixes; the number of elements is the number of
True values in the returned mask.
origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return
an n-character suffix, False otherwise.
Return the permutation that groups the array, placing equivalent
strings together. All instances of the same string are guaranteed to lie
in one contiguous block of the permuted array, but the blocks are not
necessarily ordered.
If the arkouda server is compiled with “-sSegmentedString.useHash=true”,
then arkouda uses 128-bit hash values to group strings, rather than sorting
the strings directly. This method is fast, but the resulting permutation
merely groups equivalent strings and does not sort them. If the “useHash”
parameter is false, then a full sort is performed.
Raises:
RuntimeError – Raised if there is a server-side error in executing group request or
creating the pdarray encapsulating the return message
The implementation uses SipHash128, a fast and balanced hash function (used
by Python for dictionaries and sets). For realistic numbers of strings (up
to about 10**15), the probability of a collision between two 128-bit hash
values is negligible.
Returns a boolean pdarray where index i indicates whether string i of the
Strings is alphabetic. This means there is at least one character,
and all the characters are alphabetic.
Returns:
True for elements that are alphabetic, False otherwise
Join the strings from another array onto the left of the strings
of this array, optionally inserting a delimiter.
Warning: This function is experimental and not guaranteed to work.
Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (Union[bytes,str_scalars]) – String inserted between self and other
Peel off one or more delimited fields from each string (similar
to string.partition), returning two new arrays of strings.
Warning: This function is experimental and not guaranteed to work.
Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over
the first (times-1) delimiters
includeDelimiter (bool) – If true, append the delimiter to the end of the first return
array. By default, it is prepended to the beginning of the
second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of
the delimiter will be returned in the first array. By default,
such strings are returned in the second array.
fromRight (bool) – If true, peel from the right instead of the left (see also rpeel)
regex (bool) – Indicates whether delimiter is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
left: Strings
The field(s) peeled from the end of each string (unless
fromRight is true)
right: Strings
The remainder of each string after peeling (unless fromRight
is true)
TypeError – Raised if the delimiter parameter is not byte or str_scalars, if
times is not int64, or if includeDelimiter, keepPartial, or
fromRight is not bool
ValueError – Raised if times is < 1 or if delimiter is not a valid regex
RuntimeError – Raised if there is a server-side error thrown
Register this Strings object with a user defined name in the arkouda server
so it can be attached to later using Strings.attach()
This is an in-place operation, registering a Strings object more than once will
update the name in the registry and remove the previously registered name.
A name can only be registered to one object at a time.
Parameters:
user_defined_name (str) – user defined name which the Strings object is to be registered under
Returns:
The same Strings object which is now registered with the arkouda server and
has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different objects with the same name.
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the Strings object with the user_defined_name
If the user is attempting to register more than one object with the same name,
the former should be unregistered first to free up the registration name.
Peel off one or more delimited fields from the end of each string
(similar to string.rpartition), returning two new arrays of strings.
Warning: This function is experimental and not guaranteed to work.
Parameters:
delimiter (Union[bytes, str_scalars]) – The separator where the split will occur
times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over
the last (times-1) delimiters
includeDelimiter (bool) – If true, prepend the delimiter to the start of the first return
array. By default, it is appended to the end of the
second return array.
keepPartial (bool) – If true, a string that does not contain <times> instances of
the delimiter will be returned in the second array. By default,
such strings are returned in the first array.
regex (bool) – Indicates whether delimiter is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
left: Strings
The remainder of the string after peeling
right: Strings
The field(s) that were peeled from the right of each string
DEPRECATED
Save the Strings object to HDF5 or Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. HDF5 support single files, in which case the file name will
only be that provided. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: The name of the Strings dataset to be written, defaults to strings_array
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, create a new Strings dataset within existing files.
Parameters:
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5
If False the offsets array will not be save and will be derived from the string values
upon load/read. This is not supported for Parquet files.
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
file_format (str) – By default, saved files will be written to the HDF5 file format. If
‘Parquet’, the files will be written to the Parquet file format. This
is case insensitive.
file_type (str ("single" | "distribute")) – Default: Distribute
Distribute the dataset over a file per locale.
Single file will save the dataset to one file
Return type:
String message indicating result of save operation
Notes
Important implementation notes: (1) Strings state is saved as two datasets
within an hdf5 group: one for the string characters and one for the
segments corresponding to the start of each string, (2) the hdf5 group is named
via the dataset parameter. (3) Parquet files do not store the segments,
only the values.
Returns a match object with the first location in each element where pattern produces a match.
Elements match if any part of the string matches the regular expression pattern
Parameters:
pattern (str) – Regex used to find matches
Returns:
Match object where elements match if any part of the string matches the
regular expression pattern
regex (bool) – Indicates whether substr is a regular expression
Note: only handles regular expressions supported by re2
(does not support lookaheads/lookbehinds)
Returns:
True for elements that start with substr, False otherwise
Join the strings from another array onto one end of the strings
of this array, optionally inserting a delimiter.
Warning: This function is experimental and not guaranteed to work.
Parameters:
other (Strings) – The strings to join onto self’s strings
delimiter (str) – String inserted between self and other
toLeft (bool) – If true, join other strings to the left of self. By default,
other is joined to the right of self.
Returns a new Strings object with all leading and trailing occurrences of characters contained
in chars removed. The chars argument is a string specifying the set of characters to be removed.
If omitted, the chars argument defaults to removing whitespace. The chars argument is not a
prefix or suffix; rather, all combinations of its values are stripped.
Parameters:
chars – the set of characters to be removed
Returns:
Strings object with the leading and trailing characters matching the set of characters in
the chars argument removed
Return new Strings obtained by replacing non-overlapping occurrences of pattern with the
replacement repl.
If count is nonzero, at most count substitutions occur
Write Strings to CSV file(s). File will contain a single column with the Strings data.
All CSV Files written by Arkouda include a header denoting data types of the columns.
Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing
bytes as uint(8).
Parameters:
prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended
when they are written to disk.
dataset (str) – Column name to save the Strings under. Defaults to “strings_array”.
col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file.
Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will
be overwritten. If False, an error will be returned if existing files are found.
Return type:
str reponse message
Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
All CSV files must delimit rows using newline (\n) at this time.
Save the Strings object to HDF5.
The object can be saved to a collection of files or single file.
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – The name of the Strings dataset to be written, defaults to strings_array
mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist.
If ‘append’, create a new Strings dataset within existing files.
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5
If False the offsets array will not be save and will be derived from the string values
upon load/read.
file_type (str ("single" | "distribute")) – Default: Distribute
Distribute the dataset over a file per locale.
Single file will save the dataset to one file
Return type:
String message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
Parquet files do not store the segments, only the values.
Strings state is saved as two datasets within an hdf5 group:
one for the string characters and one for the
segments corresponding to the start of each string
the hdf5 group is named via the dataset parameter.
The prefix_path must be visible to the arkouda server and the user must
have write permission.
Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’. Otherwise,
the file name will be prefix_path.
If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Convert the SegString to a list, transferring data from the
arkouda server to Python. If the SegString exceeds a built-in size limit,
a RuntimeError is raised.
Returns:
A list with the same strings as this SegString
Return type:
list
Notes
The number of bytes in the array cannot exceed ak.client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting ak.client.maxTransferBytes to a larger
value, but proceed with caution.
Convert the array to a np.ndarray, transferring array data from the
arkouda server to Python. If the array exceeds a built-in size limit,
a RuntimeError is raised.
Returns:
A numpy ndarray with the same strings as this array
Return type:
np.ndarray
Notes
The number of bytes in the array cannot exceed ak.client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting ak.client.maxTransferBytes to a larger
value, but proceed with caution.
Save the Strings object to Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’.
- ‘append’ write mode is supported, but is not efficient.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Sends a Strings object to a different Arkouda server
Parameters:
hostname (str) – The hostname where the Arkouda server intended to
receive the Strings object is running.
port (int_scalars) – The port to send the array over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
ak.receive_array().
Return type:
A message indicating a complete transfer
Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not
a supported dtype
Overwrite the dataset with the name provided with this Strings object. If
the dataset does not exist it is added
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5
If False the offsets array will not be save and will be derived from the string values
upon load/read.
repack (bool) – Default: True
HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Return type:
str - success message if successful
Raises:
RuntimeError – Raised if a server-side error is thrown saving the Strings object
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Represents a duration, the difference between two dates or times.
Timedelta is the Arkouda equivalent of pandas.TimedeltaIndex.
Parameters:
pda (int64 pdarray, pd.TimedeltaIndex, pd.Series, or np.timedelta64 array)
unit (str, default 'ns') –
For int64 pdarray, denotes the unit of the input. Ignored for pandas
and numpy arrays, which carry their own unit. Not case-sensitive;
prefixes of full names (like ‘sec’) are accepted.
Possible values:
’weeks’ or ‘w’
’days’ or ‘d’
’hours’ or ‘h’
’minutes’, ‘m’, or ‘t’
’seconds’ or ‘s’
’milliseconds’, ‘ms’, or ‘l’
’microseconds’, ‘us’, or ‘u’
’nanoseconds’, ‘ns’, or ‘n’
Unlike in pandas, units cannot be combined or mixed with integers
Notes
The .values attribute is always in nanoseconds with int64 dtype.
Register this Timedelta object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name the timedelta is to be registered under,
this will be the root name for underlying components
Returns:
The same Timedelta which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support
a fluid programming style.
Please note you cannot register two different Timedeltas with the same name.
Represents a duration, the difference between two dates or times.
Timedelta is the Arkouda equivalent of pandas.TimedeltaIndex.
Parameters:
pda (int64 pdarray, pd.TimedeltaIndex, pd.Series, or np.timedelta64 array)
unit (str, default 'ns') –
For int64 pdarray, denotes the unit of the input. Ignored for pandas
and numpy arrays, which carry their own unit. Not case-sensitive;
prefixes of full names (like ‘sec’) are accepted.
Possible values:
’weeks’ or ‘w’
’days’ or ‘d’
’hours’ or ‘h’
’minutes’, ‘m’, or ‘t’
’seconds’ or ‘s’
’milliseconds’, ‘ms’, or ‘l’
’microseconds’, ‘us’, or ‘u’
’nanoseconds’, ‘ns’, or ‘n’
Unlike in pandas, units cannot be combined or mixed with integers
Notes
The .values attribute is always in nanoseconds with int64 dtype.
Register this Timedelta object and underlying components with the Arkouda server
Parameters:
user_defined_name (str) – user defined name the timedelta is to be registered under,
this will be the root name for underlying components
Returns:
The same Timedelta which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support
a fluid programming style.
Please note you cannot register two different Timedeltas with the same name.
This is raised whenever the maximum number of candidate solutions
to consider specified by the max_work parameter is exceeded.
Assigning a finite number to max_work may have caused the operation
to fail.
The bool_ type is not a subclass of the int_ type
(the bool_ is not even a number type). This is different
than Python’s default implementation of bool as a
sub-class of int.
If a tuple, then the first element is interpreted as an attribute of
obj and the second as the docstring to apply - (method,docstring)
If a list, then each element of the list should be a tuple of length
two - [(method1,docstring1),(method2,docstring2),...]
warn_on_python (bool) – If True, the default, emit UserWarning if this is used to attach
documentation to a pure-python object.
Notes
This routine never raises an error if the docstring can’t be written, but
will raise an error if the object being documented does not exist.
This routine cannot modify read-only docstrings, as appear
in new-style classes or built-in functions. Because this
routine never raises an error the caller must check manually
that the docstrings were changed.
Since this function grabs the char* from a c-level str object and puts
it into the tp_doc slot of the type of obj, it violates a number of
C-API best-practices, by:
modifying a PyTypeObject after calling PyType_Ready
calling Py_INCREF on the str and losing the reference, so the str
will never be released
The bool_ type is not a subclass of the int_ type
(the bool_ is not even a number type). This is different
than Python’s default implementation of bool as a
sub-class of int.
The bool_ type is not a subclass of the int_ type
(the bool_ is not even a number type). This is different
than Python’s default implementation of bool as a
sub-class of int.
dt (np.dtype, type, or str) – The target dtype to cast values to
errors ({strict, ignore, return_validity}) –
Controls how errors are handled when casting strings to a numeric type
(ignored for casts from numeric types).
strict: raise RuntimeError if any string cannot be converted
ignore: never raise an error. Uninterpretable strings get
converted to NaN (float64), -2**63 (int64), zero (uint64 and
uint8), or False (bool)
return_validity: in addition to returning the same output as
“ignore”, also return a bool array indicating where the cast
was successful.
Returns:
pdarray or Strings – Array of values cast to desired dtype
[validity (pdarray(bool)]) – If errors=”return_validity” and input is Strings, a second array is
returned with True where the cast succeeded and False where it failed.
Notes
The cast is performed according to Chapel’s casting rules and is NOT safe
from overflows or underflows. The user must ensure that the target dtype
has the precision and capacity to hold the desired result.
dt (np.dtype, type, or str) – The target dtype to cast values to
errors ({strict, ignore, return_validity}) –
Controls how errors are handled when casting strings to a numeric type
(ignored for casts from numeric types).
strict: raise RuntimeError if any string cannot be converted
ignore: never raise an error. Uninterpretable strings get
converted to NaN (float64), -2**63 (int64), zero (uint64 and
uint8), or False (bool)
return_validity: in addition to returning the same output as
“ignore”, also return a bool array indicating where the cast
was successful.
Returns:
pdarray or Strings – Array of values cast to desired dtype
[validity (pdarray(bool)]) – If errors=”return_validity” and input is Strings, a second array is
returned with True where the cast succeeded and False where it failed.
Notes
The cast is performed according to Chapel’s casting rules and is NOT safe
from overflows or underflows. The user must ensure that the target dtype
has the precision and capacity to hold the desired result.
Return a pair of integers, whose ratio is exactly equal to the original
floating point number, and with a positive denominator.
Raise OverflowError on infinities and a ValueError on NaNs.
This represents a generic version of type ‘origin’ with type arguments ‘params’.
There are two kind of these aliases: user defined and special. The special ones
are wrappers around builtin collections and ABCs in collections.abc. These must
have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated,
this is used by e.g. typing.List and typing.Dict.
Create a pdarray of consecutive integers within the interval [start, stop).
If only one arg is given then arg is the stop parameter. If two args are
given, then the first arg is start and second is stop. If three args are
given, then the first arg is start, second is stop, third is stride.
The return value is cast to type dtype
Parameters:
start (int_scalars, optional) – Starting value (inclusive)
stride (int_scalars, optional) – The difference between consecutive elements, the default stride is 1,
if stride is specified then start must also be specified.
dtype (np.dtype, type, or str) – The target dtype to cast values to
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
Returns:
Integers from start (inclusive) to stop (exclusive) by stride
Negative strides result in decreasing values. Currently, only int64
pdarrays can be created with this method. For float64 arrays, use
the linspace method.
Create a pdarray of consecutive integers within the interval [start, stop).
If only one arg is given then arg is the stop parameter. If two args are
given, then the first arg is start and second is stop. If three args are
given, then the first arg is start, second is stop, third is stride.
The return value is cast to type dtype
Parameters:
start (int_scalars, optional) – Starting value (inclusive)
stride (int_scalars, optional) – The difference between consecutive elements, the default stride is 1,
if stride is specified then start must also be specified.
dtype (np.dtype, type, or str) – The target dtype to cast values to
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
Returns:
Integers from start (inclusive) to stop (exclusive) by stride
Negative strides result in decreasing values. Currently, only int64
pdarrays can be created with this method. For float64 arrays, use
the linspace method.
Create a pdarray of consecutive integers within the interval [start, stop).
If only one arg is given then arg is the stop parameter. If two args are
given, then the first arg is start and second is stop. If three args are
given, then the first arg is start, second is stop, third is stride.
The return value is cast to type dtype
Parameters:
start (int_scalars, optional) – Starting value (inclusive)
stride (int_scalars, optional) – The difference between consecutive elements, the default stride is 1,
if stride is specified then start must also be specified.
dtype (np.dtype, type, or str) – The target dtype to cast values to
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
Returns:
Integers from start (inclusive) to stop (exclusive) by stride
Negative strides result in decreasing values. Currently, only int64
pdarrays can be created with this method. For float64 arrays, use
the linspace method.
Create a pdarray of consecutive integers within the interval [start, stop).
If only one arg is given then arg is the stop parameter. If two args are
given, then the first arg is start and second is stop. If three args are
given, then the first arg is start, second is stop, third is stride.
The return value is cast to type dtype
Parameters:
start (int_scalars, optional) – Starting value (inclusive)
stride (int_scalars, optional) – The difference between consecutive elements, the default stride is 1,
if stride is specified then start must also be specified.
dtype (np.dtype, type, or str) – The target dtype to cast values to
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
Returns:
Integers from start (inclusive) to stop (exclusive) by stride
Negative strides result in decreasing values. Currently, only int64
pdarrays can be created with this method. For float64 arrays, use
the linspace method.
Create a pdarray of consecutive integers within the interval [start, stop).
If only one arg is given then arg is the stop parameter. If two args are
given, then the first arg is start and second is stop. If three args are
given, then the first arg is start, second is stop, third is stride.
The return value is cast to type dtype
Parameters:
start (int_scalars, optional) – Starting value (inclusive)
stride (int_scalars, optional) – The difference between consecutive elements, the default stride is 1,
if stride is specified then start must also be specified.
dtype (np.dtype, type, or str) – The target dtype to cast values to
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
Returns:
Integers from start (inclusive) to stop (exclusive) by stride
Negative strides result in decreasing values. Currently, only int64
pdarrays can be created with this method. For float64 arrays, use
the linspace method.
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True,
the inverse cosine will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing inverse cosine for each element
of the original pdarray
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True,
the inverse hyperbolic cosine will be applied to the corresponding value. Elsewhere, it will
retain its original value. Default set to True.
Returns:
A pdarray containing inverse hyperbolic cosine for each element
of the original pdarray
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True,
the inverse sine will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing inverse sine for each element
of the original pdarray
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True,
the inverse hyperbolic sine will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing inverse hyperbolic sine for each element
of the original pdarray
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True,
the inverse tangent will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing inverse tangent for each element
of the original pdarray
Return the element-wise inverse tangent of the array pair. The result chosen is the
signed angle in radians between the ray ending at the origin and passing through the
point (1,0), and the ray ending at the origin and passing through the point (denom, num).
The result is between -pi and pi.
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True,
the inverse tangent will be applied to the corresponding values. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing inverse tangent for each corresponding element pair
of the original pdarray, using the signed values or the numerator and
denominator to get proper placement on unit circle.
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True,
the inverse hyperbolic tangent will be applied to the corresponding value. Elsewhere,
it will retain its original value. Default set to True.
Returns:
A pdarray containing inverse hyperbolic tangent for each element
of the original pdarray
TypeError – Raised if pda is not a pdarray or k is not an integer
ValueError – Raised if the pda is empty or k < 1
Notes
This call is equivalent in value to:
ak.argsort(a)[k:]
and generally outperforms this operation.
This reduction will see a significant drop in performance as k grows
beyond a certain value. This value is system dependent, but generally
about a k of 5 million is where performance degradation has been observed.
TypeError – Raised if pda is not a pdarray or k is not an integer
ValueError – Raised if the pda is empty or k < 1
Notes
This call is equivalent in value to:
ak.argsort(a)[:k]
and generally outperforms this operation.
This reduction will see a significant drop in performance as k grows
beyond a certain value. This value is system dependent, but generally
about a k of 5 million is where performance degradation has been observed.
TypeError – Raised if a is not a pdarray, np.ndarray, or Python Iterable such as a
list, array, tuple, or deque
RuntimeError – Raised if a is not one-dimensional, nbytes > maxTransferBytes, a.dtype is
not supported (not in DTypes), or if the product of a size and
a.itemsize > maxTransferBytes
ValueError – Raised if the returned message is malformed or does not contain the fields
required to generate the array.
The number of bytes in the input array cannot exceed ak.client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overwhelming the connection between the Python client and the arkouda
server, under the assumption that it is a low-bandwidth connection. The user
may override this limit by setting ak.client.maxTransferBytes to a larger value,
but should proceed with caution.
If the pdrray or ndarray is of type U, this method is called twice recursively
to create the Strings object and the two corresponding pdarrays for string
bytes and offsets, respectively.
TypeError – Raised if a is not a pdarray, np.ndarray, or Python Iterable such as a
list, array, tuple, or deque
RuntimeError – Raised if a is not one-dimensional, nbytes > maxTransferBytes, a.dtype is
not supported (not in DTypes), or if the product of a size and
a.itemsize > maxTransferBytes
ValueError – Raised if the returned message is malformed or does not contain the fields
required to generate the array.
The number of bytes in the input array cannot exceed ak.client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overwhelming the connection between the Python client and the arkouda
server, under the assumption that it is a low-bandwidth connection. The user
may override this limit by setting ak.client.maxTransferBytes to a larger value,
but should proceed with caution.
If the pdrray or ndarray is of type U, this method is called twice recursively
to create the Strings object and the two corresponding pdarrays for string
bytes and offsets, respectively.
TypeError – Raised if a is not a pdarray, np.ndarray, or Python Iterable such as a
list, array, tuple, or deque
RuntimeError – Raised if a is not one-dimensional, nbytes > maxTransferBytes, a.dtype is
not supported (not in DTypes), or if the product of a size and
a.itemsize > maxTransferBytes
ValueError – Raised if the returned message is malformed or does not contain the fields
required to generate the array.
The number of bytes in the input array cannot exceed ak.client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overwhelming the connection between the Python client and the arkouda
server, under the assumption that it is a low-bandwidth connection. The user
may override this limit by setting ak.client.maxTransferBytes to a larger value,
but should proceed with caution.
If the pdrray or ndarray is of type U, this method is called twice recursively
to create the Strings object and the two corresponding pdarrays for string
bytes and offsets, respectively.
TypeError – Raised if a is not a pdarray, np.ndarray, or Python Iterable such as a
list, array, tuple, or deque
RuntimeError – Raised if a is not one-dimensional, nbytes > maxTransferBytes, a.dtype is
not supported (not in DTypes), or if the product of a size and
a.itemsize > maxTransferBytes
ValueError – Raised if the returned message is malformed or does not contain the fields
required to generate the array.
The number of bytes in the input array cannot exceed ak.client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overwhelming the connection between the Python client and the arkouda
server, under the assumption that it is a low-bandwidth connection. The user
may override this limit by setting ak.client.maxTransferBytes to a larger value,
but should proceed with caution.
If the pdrray or ndarray is of type U, this method is called twice recursively
to create the Strings object and the two corresponding pdarrays for string
bytes and offsets, respectively.
Compares two pdarrays for equality.
If neither array has any nan elements, then if all elements are pairwise equal,
it returns True.
If equal_Nan is False, then any nan element in either array gives a False return.
If equal_Nan is True, then pairwise-corresponding nans are considered equal.
check_dtype (bool, default True) – Check that integer dtype of the codes are the same.
check_category_order (bool, default True) – Whether the order of the categories should be compared, which
implies identical integer codes. If False, only the resulting
values are compared. The ordered attribute is
checked regardless.
obj (str, default 'Categorical') – Specify object name being compared, internally used to show appropriate
assertion message.
Assert that two dictionaries are equal.
Values must be arkouda objects.
:param left: The dictionaries to be compared.
:type left: dict
:param right: The dictionaries to be compared.
:type right: dict
:param compare_keys: Whether to compare the keys.
This function is intended to compare two DataFrames and output any
differences. It is mostly intended for use in unit tests.
Additional parameters allow varying the strictness of the
equality checks performed.
check_dtype (bool, default True) – Whether to check the DataFrame dtype is identical.
check_index_type (bool, default = True) – Whether to check the Index class, dtype and inferred_type
are identical.
check_column_type (bool or {'equiv'}, default 'equiv') – Whether to check the columns class, dtype and inferred_type
are identical. Is passed as the exact argument of
assert_index_equal().
check_frame_type (bool, default True) – Whether to check the DataFrame class is identical.
check_names (bool, default True) – Whether to check that the names attribute for both the index
and column attributes of the DataFrame is identical.
check_exact (bool, default False) – Whether to compare number exactly.
check_like (bool, default False) – If True, ignore the order of index & columns.
Note: index labels must match their respective rows
(same as in columns) - same labels must be with the same data.
rtol (float, default 1e-5) – Relative tolerance. Only used when check_exact is False.
atol (float, default 1e-8) – Absolute tolerance. Only used when check_exact is False.
obj (str, default 'DataFrame') – Specify object name being compared, internally used to show appropriate
assertion message.
This function is intended to compare two DataFrames and output any
differences. It is mostly intended for use in unit tests.
Additional parameters allow varying the strictness of the
equality checks performed.
pd.DataFrame’s will be converted to the arkouda equivalent.
Then assert_frame_equal will be applied to the result.
Parameters:
left (DataFrame or pd.DataFrame) – First DataFrame to compare.
right (DataFrame or pd.DataFrame) – Second DataFrame to compare.
check_dtype (bool, default True) – Whether to check the DataFrame dtype is identical.
check_index_type (bool, default = True) – Whether to check the Index class, dtype and inferred_type
are identical.
check_column_type (bool or {'equiv'}, default 'equiv') – Whether to check the columns class, dtype and inferred_type
are identical. Is passed as the exact argument of
assert_index_equal().
check_frame_type (bool, default True) – Whether to check the DataFrame class is identical.
check_names (bool, default True) – Whether to check that the names attribute for both the index
and column attributes of the DataFrame is identical.
check_exact (bool, default False) – Whether to compare number exactly.
check_like (bool, default False) – If True, ignore the order of index & columns.
Note: index labels must match their respective rows
(same as in columns) - same labels must be with the same data.
rtol (float, default 1e-5) – Relative tolerance. Only used when check_exact is False.
atol (float, default 1e-8) – Absolute tolerance. Only used when check_exact is False.
obj (str, default 'DataFrame') – Specify object name being compared, internally used to show appropriate
assertion message.
check_order (bool, default True) – Whether to compare the order of index entries as well as their values.
If True, both indexes must contain the same elements, in the same order.
If False, both indexes must contain the same elements, but in any order.
rtol (float, default 1e-5) – Relative tolerance. Only used when check_exact is False.
atol (float, default 1e-8) – Absolute tolerance. Only used when check_exact is False.
obj (str, default 'Index') – Specify object name being compared, internally used to show appropriate
assertion message.
check_order (bool, default True) – Whether to compare the order of index entries as well as their values.
If True, both indexes must contain the same elements, in the same order.
If False, both indexes must contain the same elements, but in any order.
rtol (float, default 1e-5) – Relative tolerance. Only used when check_exact is False.
atol (float, default 1e-8) – Absolute tolerance. Only used when check_exact is False.
obj (str, default 'Index') – Specify object name being compared, internally used to show appropriate
assertion message.
check_category_order (bool, default True) – Whether to compare category order of internal Categoricals.
rtol (float, default 1e-5) – Relative tolerance. Only used when check_exact is False.
atol (float, default 1e-8) – Absolute tolerance. Only used when check_exact is False.
obj (str, default 'Series') – Specify object name being compared, internally used to show appropriate
assertion message.
check_index (bool, default True) – Whether to check index equivalence. If False, then compare only values.
check_like (bool, default False) – If True, ignore the order of the index. Must be False if check_index is False.
Note: same labels must be with the same data.
check_category_order (bool, default True) – Whether to compare category order of internal Categoricals.
rtol (float, default 1e-5) – Relative tolerance. Only used when check_exact is False.
atol (float, default 1e-8) – Absolute tolerance. Only used when check_exact is False.
obj (str, default 'Series') – Specify object name being compared, internally used to show appropriate
assertion message.
check_index (bool, default True) – Whether to check index equivalence. If False, then compare only values.
check_like (bool, default False) – If True, ignore the order of the index. Must be False if check_index is False.
Note: same labels must be with the same data.
Registered names/pdarrays in the server are immune to deletion
until they are unregistered.
Examples
>>> a=zeros(100)>>> a.register("my_zeros")>>> # potentially disconnect from server and reconnect to server>>> b=ak.attach_pdarray("my_zeros")>>> # ...other work...>>> b.unregister()
Convert a number or string to an integer, or return 0 if no arguments
are given. If x is a number, return x.__int__(). For floating point
numbers, this truncates towards zero.
If x is not a number or if base is given, then x must be a string,
bytes, or bytearray instance representing an integer literal in the
given base. The literal can be preceded by ‘+’ or ‘-’ and be surrounded
by whitespace. The base defaults to 10. Valid bases are 0 and 2-36.
Base 0 means to interpret the base from the string as an integer literal.
>>> int(‘0b100’, base=0)
4
Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to ‘strict’.
Convert a number or string to an integer, or return 0 if no arguments
are given. If x is a number, return x.__int__(). For floating point
numbers, this truncates towards zero.
If x is not a number or if base is given, then x must be a string,
bytes, or bytearray instance representing an integer literal in the
given base. The literal can be preceded by ‘+’ or ‘-’ and be surrounded
by whitespace. The base defaults to 10. Valid bases are 0 and 2-36.
Base 0 means to interpret the base from the string as an integer literal.
>>> int(‘0b100’, base=0)
4
Create a bigint pdarray from an iterable of uint pdarrays.
The first item in arrays will be the highest 64 bits and
the last item will be the lowest 64 bits.
Parameters:
arrays (Sequence[pdarray]) – An iterable of uint pdarrays used to construct the bigint pdarray.
The first item in arrays will be the highest 64 bits and
the last item will be the lowest 64 bits.
max_bits (int) – Specifies the maximum number of bits; only used for bigint pdarrays
Return the binary representation of the input number as a string.
For negative numbers, if width is not given, a minus sign is added to the
front. If width is given, the two’s complement of the number is
returned, with respect to that width.
In a two’s-complement system negative numbers are represented by the two’s
complement of the absolute value. This is the most common method of
representing signed integers on computers [1]_. A N-bit two’s-complement
system can represent every integer in the range
\(-2^{N-1}\) to \(+2^{N-1}-1\).
Parameters:
num (int) – Only an integer decimal number can be used.
width (int, optional) –
The length of the returned string if num is positive, or the length
of the two’s complement if num is negative, provided that width is
at least a sufficient number of bits for num to be represented in the
designated form.
If the width value is insufficient, it will be ignored, and num will
be returned in binary (num > 0) or two’s complement (num < 0) form
with its width equal to the minimum number of bits needed to represent
the number in the designated form. This behavior is deprecated and will
later raise an error.
Deprecated since version 1.12.0.
Returns:
bin – Binary representation of num or two’s complement of num.
The bool_ type is not a subclass of the int_ type
(the bool_ is not even a number type). This is different
than Python’s default implementation of bool as a
sub-class of int.
This represents a generic version of type ‘origin’ with type arguments ‘params’.
There are two kind of these aliases: user defined and special. The special ones
are wrappers around builtin collections and ABCs in collections.abc. These must
have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated,
this is used by e.g. typing.List and typing.Dict.
Broadcast a dense column vector to the rows of a sparse matrix or grouped array.
Parameters:
segments (pdarray, int64) – Offsets of the start of each row in the sparse matrix or grouped array.
Must be sorted in ascending order.
values (pdarray, Strings) – The values to broadcast, one per row (or group)
size (int) – The total number of nonzeros in the matrix. If permutation is given, this
argument is ignored and the size is inferred from the permutation array.
permutation (pdarray, int64) – The permutation to go from the original ordering of nonzeros to the ordering
grouped by row. To broadcast values back to the original ordering, this
permutation will be inverted. If no permutation is supplied, it is assumed
that the original nonzeros were already grouped by row. In this case, the
size argument must be given.
If number of nonzeros (either user-specified or inferred from permutation)
is less than one
Examples
>>># Define a sparse matrix with 3 rows and 7 nonzeros>>> row_starts=ak.array([0,2,5])>>> nnz=7# Broadcast the row number to each nonzero element>>> row_number=ak.arange(3)>>> ak.broadcast(row_starts,row_number,nnz)array([0 0 1 1 1 2 2])# If the original nonzeros were in reverse order...>>> permutation=ak.arange(6,-1,-1)>>> ak.broadcast(row_starts,row_number,permutation=permutation)array([2 2 1 1 1 0 0])
Broadcast a dense column vector to the rows of a sparse matrix or grouped array.
Parameters:
segments (pdarray, int64) – Offsets of the start of each row in the sparse matrix or grouped array.
Must be sorted in ascending order.
values (pdarray, Strings) – The values to broadcast, one per row (or group)
size (int) – The total number of nonzeros in the matrix. If permutation is given, this
argument is ignored and the size is inferred from the permutation array.
permutation (pdarray, int64) – The permutation to go from the original ordering of nonzeros to the ordering
grouped by row. To broadcast values back to the original ordering, this
permutation will be inverted. If no permutation is supplied, it is assumed
that the original nonzeros were already grouped by row. In this case, the
size argument must be given.
If number of nonzeros (either user-specified or inferred from permutation)
is less than one
Examples
>>># Define a sparse matrix with 3 rows and 7 nonzeros>>> row_starts=ak.array([0,2,5])>>> nnz=7# Broadcast the row number to each nonzero element>>> row_number=ak.arange(3)>>> ak.broadcast(row_starts,row_number,nnz)array([0 0 1 1 1 2 2])# If the original nonzeros were in reverse order...>>> permutation=ak.arange(6,-1,-1)>>> ak.broadcast(row_starts,row_number,permutation=permutation)array([2 2 1 1 1 0 0])
Broadcast a dense column vector to the rows of a sparse matrix or grouped array.
Parameters:
segments (pdarray, int64) – Offsets of the start of each row in the sparse matrix or grouped array.
Must be sorted in ascending order.
values (pdarray, Strings) – The values to broadcast, one per row (or group)
size (int) – The total number of nonzeros in the matrix. If permutation is given, this
argument is ignored and the size is inferred from the permutation array.
permutation (pdarray, int64) – The permutation to go from the original ordering of nonzeros to the ordering
grouped by row. To broadcast values back to the original ordering, this
permutation will be inverted. If no permutation is supplied, it is assumed
that the original nonzeros were already grouped by row. In this case, the
size argument must be given.
If number of nonzeros (either user-specified or inferred from permutation)
is less than one
Examples
>>># Define a sparse matrix with 3 rows and 7 nonzeros>>> row_starts=ak.array([0,2,5])>>> nnz=7# Broadcast the row number to each nonzero element>>> row_number=ak.arange(3)>>> ak.broadcast(row_starts,row_number,nnz)array([0 0 1 1 1 2 2])# If the original nonzeros were in reverse order...>>> permutation=ak.arange(6,-1,-1)>>> ak.broadcast(row_starts,row_number,permutation=permutation)array([2 2 1 1 1 0 0])
dt (np.dtype, type, or str) – The target dtype to cast values to
errors ({strict, ignore, return_validity}) –
Controls how errors are handled when casting strings to a numeric type
(ignored for casts from numeric types).
strict: raise RuntimeError if any string cannot be converted
ignore: never raise an error. Uninterpretable strings get
converted to NaN (float64), -2**63 (int64), zero (uint64 and
uint8), or False (bool)
return_validity: in addition to returning the same output as
“ignore”, also return a bool array indicating where the cast
was successful.
Returns:
pdarray or Strings – Array of values cast to desired dtype
[validity (pdarray(bool)]) – If errors=”return_validity” and input is Strings, a second array is
returned with True where the cast succeeded and False where it failed.
Notes
The cast is performed according to Chapel’s casting rules and is NOT safe
from overflows or underflows. The user must ensure that the target dtype
has the precision and capacity to hold the desired result.
dt (np.dtype, type, or str) – The target dtype to cast values to
errors ({strict, ignore, return_validity}) –
Controls how errors are handled when casting strings to a numeric type
(ignored for casts from numeric types).
strict: raise RuntimeError if any string cannot be converted
ignore: never raise an error. Uninterpretable strings get
converted to NaN (float64), -2**63 (int64), zero (uint64 and
uint8), or False (bool)
return_validity: in addition to returning the same output as
“ignore”, also return a bool array indicating where the cast
was successful.
Returns:
pdarray or Strings – Array of values cast to desired dtype
[validity (pdarray(bool)]) – If errors=”return_validity” and input is Strings, a second array is
returned with True where the cast succeeded and False where it failed.
Notes
The cast is performed according to Chapel’s casting rules and is NOT safe
from overflows or underflows. The user must ensure that the target dtype
has the precision and capacity to hold the desired result.
>>> a=ak.array([1,2,3,4,5,6,7,8,9,10])>>> ak.clip(a,3,8)array([3,3,3,4,5,6,7,8,8,8])>>> ak.clip(a,3,8.0)array([3.00000000000000000 3.00000000000000000 3.00000000000000000 4.00000000000000000 5.00000000000000000 6.00000000000000000 7.00000000000000000 8.00000000000000000 8.00000000000000000 8.00000000000000000])>>> ak.clip(a,None,7)array([1,2,3,4,5,6,7,7,7,7])>>> ak.clip(a,5,None)array([5,5,5,5,5,6,7,8,9,10])>>> ak.clip(a,None,None)ValueError : either min or max must be supplied>>> ak.clip(a,ak.array([2,2,3,3,8,8,5,5,6,6],8))array([2,2,3,4,8,8,7,8,8,8])>>> ak.clip(a,4,ak.array([10,9,8,7,6,5,5,5,5,5]))array([4,4,4,4,5,5,5,5,5,5])
Notes
Either lo or hi may be None, but not both.
If lo > hi, all x = hi.
If all inputs are int64, output is int64, but if any input is float64, output is float64.
Return the permutation that groups the rows (left-to-right), if the
input arrays are treated as columns. The permutation sorts numeric
columns, but not strings/Categoricals – strings/Categoricals are grouped, but not ordered.
Parameters:
arrays (Sequence[Union[Strings, pdarray, Categorical]]) – The columns (int64, uint64, float64, Strings, or Categorical) to sort by row
Returns:
The indices that permute the rows to grouped order
Uses a least-significant-digit radix sort, which is stable and resilient
to non-uniformity in data but communication intensive. Starts with the
last array and moves forward. This sort operates directly on numeric types,
but for Strings, it operates on a hash. Thus, while grouping of equivalent
strings is guaranteed, lexicographic ordering of the groups is not. For Categoricals,
coargsort sorts based on Categorical.codes which guarantees grouping of equivalent categories
but not lexicographic ordering of those groups.
Return the permutation that groups the rows (left-to-right), if the
input arrays are treated as columns. The permutation sorts numeric
columns, but not strings/Categoricals – strings/Categoricals are grouped, but not ordered.
Parameters:
arrays (Sequence[Union[Strings, pdarray, Categorical]]) – The columns (int64, uint64, float64, Strings, or Categorical) to sort by row
Returns:
The indices that permute the rows to grouped order
Uses a least-significant-digit radix sort, which is stable and resilient
to non-uniformity in data but communication intensive. Starts with the
last array and moves forward. This sort operates directly on numeric types,
but for Strings, it operates on a hash. Thus, while grouping of equivalent
strings is guaranteed, lexicographic ordering of the groups is not. For Categoricals,
coargsort sorts based on Categorical.codes which guarantees grouping of equivalent categories
but not lexicographic ordering of those groups.
ordered (bool) – If True (default), the arrays will be appended in the
order given. If False, array data may be interleaved
in blocks, which can greatly improve performance but
results in non-deterministic ordering of elements.
Returns:
Single pdarray or Strings object containing all values, returned in
the original order
ordered (bool) – If True (default), the arrays will be appended in the
order given. If False, array data may be interleaved
in blocks, which can greatly improve performance but
results in non-deterministic ordering of elements.
Returns:
Single pdarray or Strings object containing all values, returned in
the original order
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True,
the cosine will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing cosine for each element
of the original pdarray
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True,
the hyperbolic cosine will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing hyperbolic cosine for each element
of the original pdarray
Creates a fixed frequency Datetime range. Alias for
ak.Datetime(pd.date_range(args)). Subject to size limit
imposed by client.maxTransferBytes.
Parameters:
start (str or datetime-like, optional) – Left bound for generating dates.
end (str or datetime-like, optional) – Right bound for generating dates.
periods (int, optional) – Number of periods to generate.
freq (str or DateOffset, default 'D') – Frequency strings can have multiples, e.g. ‘5H’. See
timeseries.offset_aliases for a list of
frequency aliases.
tz (str or tzinfo, optional) – Time zone name for returning localized DatetimeIndex, for example
‘Asia/Hong_Kong’. By default, the resulting DatetimeIndex is
timezone-naive.
normalize (bool, default False) – Normalize start/end dates to midnight before generating date range.
name (str, default None) – Name of the resulting DatetimeIndex.
closed ({None, 'left', 'right'}, optional) – Make the interval closed with respect to the given frequency to
the ‘left’, ‘right’, or both sides (None, the default).
Deprecated
inclusive ({"both", "neither", "left", "right"}, default "both") – Include boundaries. Whether to set each bound as closed or open.
**kwargs – For compatibility. Has no effect on the result.
Returns:
rng
Return type:
DatetimeIndex
Notes
Of the four parameters start, end, periods, and freq,
exactly three must be specified. If freq is omitted, the resulting
DatetimeIndex will have periods linearly spaced elements between
start and end (closed on both sides).
To learn more about the frequency strings, please see this link.
Creates a fixed frequency Datetime range. Alias for
ak.Datetime(pd.date_range(args)). Subject to size limit
imposed by client.maxTransferBytes.
Parameters:
start (str or datetime-like, optional) – Left bound for generating dates.
end (str or datetime-like, optional) – Right bound for generating dates.
periods (int, optional) – Number of periods to generate.
freq (str or DateOffset, default 'D') – Frequency strings can have multiples, e.g. ‘5H’. See
timeseries.offset_aliases for a list of
frequency aliases.
tz (str or tzinfo, optional) – Time zone name for returning localized DatetimeIndex, for example
‘Asia/Hong_Kong’. By default, the resulting DatetimeIndex is
timezone-naive.
normalize (bool, default False) – Normalize start/end dates to midnight before generating date range.
name (str, default None) – Name of the resulting DatetimeIndex.
closed ({None, 'left', 'right'}, optional) – Make the interval closed with respect to the given frequency to
the ‘left’, ‘right’, or both sides (None, the default).
Deprecated
inclusive ({"both", "neither", "left", "right"}, default "both") – Include boundaries. Whether to set each bound as closed or open.
**kwargs – For compatibility. Has no effect on the result.
Returns:
rng
Return type:
DatetimeIndex
Notes
Of the four parameters start, end, periods, and freq,
exactly three must be specified. If freq is omitted, the resulting
DatetimeIndex will have periods linearly spaced elements between
start and end (closed on both sides).
To learn more about the frequency strings, please see this link.
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the
corresponding value will be converted from degrees to radians. Elsewhere, it will retain its
original value. Default set to True.
Returns:
A pdarray containing an angle converted to radians, from degrees, for each element
of the original pdarray
obj (Union[pdarray, slice, int]) – The indices to remove from ‘arr’. If obj is a pdarray, it must
have an integer dtype.
axis (Optional[int], optional) – The axis along which to remove elements. If None, the array will
be flattened before removing elements. Defaults to None.
Issues a DeprecationWarning, adds warning to old_name’s
docstring, rebinds old_name.__name__ and returns the new
function object.
This function may also be used as a decorator.
Parameters:
func (function) – The function to be deprecated.
old_name (str, optional) – The name of the function to be deprecated. Default is None, in
which case the name of func is used.
new_name (str, optional) – The new name for the function. Default is None, in which case the
deprecation message is that old_name is deprecated. If given, the
deprecation message is that old_name is deprecated and new_name
should be used instead.
message (str, optional) – Additional explanation of the deprecation. Displayed in the
docstring after the warning.
Returns:
old_func – The deprecated function.
Return type:
function
Examples
Note that olduint returns a value after printing Deprecation
Warning:
>>> olduint=np.deprecate(np.uint)DeprecationWarning: `uint64` is deprecated! # may vary>>> olduint(6)6
Deprecates a function and includes the deprecation in its docstring.
This function is used as a decorator. It returns an object that can be
used to issue a DeprecationWarning, by passing the to-be decorated
function as argument, this adds warning to the to-be decorated function’s
docstring and returns the new function object.
device (object) – Device to write message. If None, defaults to sys.stdout which is
very similar to print. device needs to have write() and
flush() methods.
linefeed (bool, optional) – Option whether to print a line feed or not. Defaults to True.
Raises:
AttributeError – If device does not have a write() or flush() method.
Examples
Besides sys.stdout, a file-like object can also be used as it has
both required methods:
>>> fromioimportStringIO>>> buf=StringIO()>>> np.disp(u'"Display" in a file',device=buf)>>> buf.getvalue()'"Display" in a file\n'
x (numeric_scalars(float_scalars, int_scalars) or pdarray) – The dividend array, the values that will be the numerator of the floordivision and will be
acted on by the bases for modular division.
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the
corresponding value will be divided using floor and modular division. Elsewhere, it will retain
its original value. Default set to True.
Returns:
Returns a tuple that contains quotient and remainder of the division
Return a pair of integers, whose ratio is exactly equal to the original
floating point number, and with a positive denominator.
Raise OverflowError on infinities and a ValueError on NaNs.
diag (int_scalars) – if diag = 0, zeros start at element [0,0] and proceed along diagonal
if diag > 0, zeros start at element [0,diag] and proceed along diagonal
if diag < 0, zeros start at element [diag,0] and proceed along diagonal
etc.
Returns:
an array of zeros with ones along the specified diagonal
Returns the dtype for which finfo returns information. For complex
input, the returned dtype is the associated float* dtype for its
real and complex components.
The difference between 1.0 and the next smallest representable float
larger than 1.0. For example, for 64-bit binary floats in the IEEE-754
standard, eps=2**-52, approximately 2.22e-16.
The difference between 1.0 and the next smallest representable float
less than 1.0. For example, for 64-bit binary floats in the IEEE-754
standard, epsneg=2**-53, approximately 1.11e-16.
The distance between a value and the nearest adjacent number
nextafter
The next floating point value after x1 towards x2
Notes
For developers of NumPy: do not instantiate this at the module level.
The initial calculation of these parameters is expensive and negatively
impacts import times. These objects are cached, so calling finfo()
repeatedly inside your functions is not a problem.
Note that smallest_normal is not actually the smallest positive
representable value in a NumPy floating point type. As in the IEEE-754
standard [1]_, NumPy floating point types make use of subnormal numbers to
fill the gap between 0 and smallest_normal. However, subnormal numbers
may have significantly reduced precision [2].
This function can also be used for complex data types as well. If used,
the output will be the same as the corresponding real float type
(e.g. numpy.finfo(numpy.csingle) is the same as numpy.finfo(numpy.single)).
However, the output is true for the real and imaginary components.
Return a pair of integers, whose ratio is exactly equal to the original
floating point number, and with a positive denominator.
Raise OverflowError on infinities and a ValueError on NaNs.
Return a pair of integers, whose ratio is exactly equal to the original
floating point number, and with a positive denominator.
Raise OverflowError on infinities and a ValueError on NaNs.
Return a pair of integers, whose ratio is exactly equal to the original
floating point number, and with a positive denominator.
Raise OverflowError on infinities and a ValueError on NaNs.
Return a pair of integers, whose ratio is exactly equal to the original
floating point number, and with a positive denominator.
Raise OverflowError on infinities and a ValueError on NaNs.
This represents a generic version of type ‘origin’ with type arguments ‘params’.
There are two kind of these aliases: user defined and special. The special ones
are wrappers around builtin collections and ABCs in collections.abc. These must
have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated,
this is used by e.g. typing.List and typing.Dict.
Format a floating-point scalar as a decimal string in positional notation.
Provides control over rounding, trimming and padding. Uses and assumes
IEEE unbiased rounding. Uses the “Dragon4” algorithm.
Parameters:
x (python float or numpy floating scalar) – Value to format.
precision (non-negative integer or None, optional) – Maximum number of digits to print. May be None if unique is
True, but must be an integer if unique is False.
unique (boolean, optional) – If True, use a digit-generation strategy which gives the shortest
representation which uniquely identifies the floating-point number from
other values of the same type, by judicious rounding. If precision
is given fewer digits than necessary can be printed, or if min_digits
is given more can be printed, in which cases the last digit is rounded
with unbiased rounding.
If False, digits are generated as if printing an infinite-precision
value and stopping after precision digits, rounding the remaining
value with unbiased rounding
fractional (boolean, optional) – If True, the cutoffs of precision and min_digits refer to the
total number of digits after the decimal point, including leading
zeros.
If False, precision and min_digits refer to the total number of
significant digits, before or after the decimal point, ignoring leading
zeros.
trim (one of 'k', '.', '0', '-', optional) –
Controls post-processing trimming of trailing digits, as follows:
’k’ : keep trailing zeros, keep decimal point (no trimming)
’.’ : trim all trailing zeros, leave decimal point
’0’ : trim all but the zero before the decimal point. Insert the
zero if it is missing.
’-’ : trim trailing zeros and any trailing decimal point
sign (boolean, optional) – Whether to show the sign for positive values.
pad_left (non-negative integer, optional) – Pad the left side of the string with whitespace until at least that
many characters are to the left of the decimal point.
pad_right (non-negative integer, optional) – Pad the right side of the string with whitespace until at least that
many characters are to the right of the decimal point.
min_digits (non-negative integer or None, optional) –
Minimum number of digits to print. Only has an effect if unique=True
in which case additional digits past those necessary to uniquely
identify the value may be printed, rounding the last additional digit.
– versionadded:: 1.21.0
Returns:
rep – The string representation of the floating point value
Format a floating-point scalar as a decimal string in scientific notation.
Provides control over rounding, trimming and padding. Uses and assumes
IEEE unbiased rounding. Uses the “Dragon4” algorithm.
Parameters:
x (python float or numpy floating scalar) – Value to format.
precision (non-negative integer or None, optional) – Maximum number of digits to print. May be None if unique is
True, but must be an integer if unique is False.
unique (boolean, optional) – If True, use a digit-generation strategy which gives the shortest
representation which uniquely identifies the floating-point number from
other values of the same type, by judicious rounding. If precision
is given fewer digits than necessary can be printed. If min_digits
is given more can be printed, in which cases the last digit is rounded
with unbiased rounding.
If False, digits are generated as if printing an infinite-precision
value and stopping after precision digits, rounding the remaining
value with unbiased rounding
trim (one of 'k', '.', '0', '-', optional) –
Controls post-processing trimming of trailing digits, as follows:
’k’ : keep trailing zeros, keep decimal point (no trimming)
’.’ : trim all trailing zeros, leave decimal point
’0’ : trim all but the zero before the decimal point. Insert the
zero if it is missing.
’-’ : trim trailing zeros and any trailing decimal point
sign (boolean, optional) – Whether to show the sign for positive values.
pad_left (non-negative integer, optional) – Pad the left side of the string with whitespace until at least that
many characters are to the left of the decimal point.
exp_digits (non-negative integer, optional) – Pad the exponent with zeros until it contains at least this many digits.
If omitted, the exponent will be at least 2 digits.
min_digits (non-negative integer or None, optional) –
Minimum number of digits to print. This only has an effect for
unique=True. In that case more digits than necessary to uniquely
identify the value may be printed and rounded unbiased.
– versionadded:: 1.21.0
Returns:
rep – The string representation of the floating point value
formats (str or list of str) – The format description, either specified as a string with
comma-separated format descriptions in the form 'f8,i4,a5', or
a list of format description strings in the form
['f8','i4','a5'].
names (str or list/tuple of str) – The field names, either specified as a comma-separated string in the
form 'col1,col2,col3', or as a list or tuple of strings in the
form ['col1','col2','col3'].
An empty list can be used, in that case default field names
(‘f0’, ‘f1’, …) are used.
titles (sequence) – Sequence of title strings. An empty list can be used to leave titles
out.
aligned (bool, optional) – If True, align the fields by padding as the C-compiler would.
Default is False.
byteorder (str, optional) – If specified, all the fields will be changed to the
provided byte-order. Otherwise, the default byte-order is
used. For all available string specifiers, see dtype.newbyteorder.
names and/or titles can be empty lists. If titles is an empty list,
titles will simply not appear. If names is empty, default field names
will be used.
Converts a Pandas Series to an Arkouda pdarray or Strings object. If
dtype is None, the dtype is inferred from the Pandas Series. Otherwise,
the dtype parameter is set if the dtype of the Pandas Series is to be
overridden or is unknown (for example, in situations where the Series
dtype is object).
Parameters:
series (Pandas Series) – The Pandas Series with a dtype of bool, float64, int64, or string
dtype (Optional[type]) – The valid dtype types are np.bool, np.float64, np.int64, and np.str
The supported datatypes are bool, float64, int64, string, and datetime64[ns]. The
data type is either inferred from the the Series or is set via the dtype parameter.
Series of datetime or timedelta are converted to Arkouda arrays of dtype int64 (nanoseconds)
A Pandas Series containing strings has a dtype of object. Arkouda assumes the Series
contains strings and sets the dtype to str
Converts a Pandas Series to an Arkouda pdarray or Strings object. If
dtype is None, the dtype is inferred from the Pandas Series. Otherwise,
the dtype parameter is set if the dtype of the Pandas Series is to be
overridden or is unknown (for example, in situations where the Series
dtype is object).
Parameters:
series (Pandas Series) – The Pandas Series with a dtype of bool, float64, int64, or string
dtype (Optional[type]) – The valid dtype types are np.bool, np.float64, np.int64, and np.str
The supported datatypes are bool, float64, int64, string, and datetime64[ns]. The
data type is either inferred from the the Series or is set via the dtype parameter.
Series of datetime or timedelta are converted to Arkouda arrays of dtype int64 (nanoseconds)
A Pandas Series containing strings has a dtype of object. Arkouda assumes the Series
contains strings and sets the dtype to str
A convenience method for instantiating an ArkoudaLogger that retrieves the
logging level from the ARKOUDA_LOG_LEVEL env variable
Parameters:
name (str) – The name of the ArkoudaLogger
handlers (List[Handler]) – A list of logging.Handler objects, if None, a list consisting of
one StreamHandler named ‘console-handler’ is generated and configured
logFormat (str) – The format for log messages, defaults to the following format:
‘[%(name)s] Line %(lineno)d %(levelname)s: %(message)s’
Return type:
ArkoudaLogger
Raises:
TypeError – Raised if either name or logFormat is not a str object or if handlers
is not a list of str objects
Notes
Important note: if a list of 1..n logging.Handler objects is passed in, and
dynamic changes to 1..n handlers is desired, set a name for each Handler
object as follows: handler.name = <desired name>, which will enable retrieval
and updates for the specified handler.
Get the names of the datasets in the provide files
Parameters:
filenames (str or List[str]) – Name of the file/s from which to return datasets
allow_errors (bool) – Default: False
Whether or not to allow errors while accessing datasets
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False,
SegArray (or other nested Parquet columns) will be ignored.
Only used for Parquet Files.
Return type:
List[str] of names of the datasets
Raises:
RuntimeError –
If no datasets are returned
Notes
This function currently supports HDF5 and Parquet formats.
Future updates to Parquet will deprecate this functionality on that format,
but similar support will be added for Parquet at that time.
- If a list of files is provided, only the datasets in the first file will be returned
Get null indices of a string column in a Parquet file.
Parameters:
filenames (list or str) – Either a list of filenames or shell expression
datasets (list or str or None) – (List of) name(s) of dataset(s) to read. Each dataset must be a string
column. There is no default value for this function, the datasets to be
read must be specified.
Returns:
Dictionary of {datasetName: pdarray}
Return type:
returns a dictionary of Arkouda pdarrays
Raises:
RuntimeError – Raised if one or more of the specified files cannot be opened.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Return a pair of integers, whose ratio is exactly equal to the original
floating point number, and with a positive denominator.
Raise OverflowError on infinities and a ValueError on NaNs.
full (bool) – This is only used when a single pdarray is passed into hash
By default, a 128-bit hash is computed and returned as
two int64 arrays. If full=False, then a 64-bit hash
is computed and returned as a single int64 array.
Returns:
If full=True or a list of pdarrays is passed,
a 2-tuple of pdarrays containing the high
and low 64 bits of each hash, respectively.
If full=False and a single pdarray is passed,
a single pdarray containing a 64-bit hash
Return type:
hashes
Raises:
TypeError – Raised if the parameter is not a pdarray
Notes
In the case of a single pdarray being passed, this function
uses the SIPhash algorithm, which can output either a 64-bit
or 128-bit hash. However, the 64-bit hash runs a significant
risk of collisions when applied to more than a few million
unique values. Unless the number of unique values is known to
be small, the 128-bit hash is strongly recommended.
Note that this hash should not be used for security, or for
any cryptographic application. Not only is SIPhash not
intended for such uses, but this implementation employs a
fixed key for the hash, which makes it possible for an
adversary with control over input to engineer collisions.
In the case of a list of pdrrays, Strings, Categoricals, or Segarrays
being passed, a non-linear function must be applied to each
array since hashes of subsequent arrays cannot be simply XORed
because equivalent values will cancel each other out, hence we
do a rotation by the ordinal of the array.
Compute the bi-dimensional histogram of two data samples with evenly spaced bins
Parameters:
x (pdarray) – A pdarray containing the x coordinates of the points to be histogrammed.
y (pdarray) – A pdarray containing the y coordinates of the points to be histogrammed.
bins (int_scalars or [int, int] = 10) – The number of equal-size bins to use.
If int, the number of bins for the two dimensions (nx=ny=bins).
If [int, int], the number of bins in each dimension (nx, ny = bins).
Defaults to 10
Returns:
hist (pdarray) – shape(nx, ny)
The bi-dimensional histogram of samples x and y.
Values in x are histogrammed along the first dimension and
values in y are histogrammed along the second dimension.
x_edges (pdarray) – The bin edges along the first dimension.
y_edges (pdarray) – The bin edges along the second dimension.
Raises:
TypeError – Raised if x or y parameters are not pdarrays or if bins is
not an int or (int, int).
ValueError – Raised if bins < 1
NotImplementedError – Raised if pdarray dtype is bool or uint8
Compute the multidimensional histogram of data in sample with evenly spaced bins.
Parameters:
sample (Sequence[pdarray]) – A sequence of pdarrays containing the coordinates of the points to be histogrammed.
bins (int_scalars or Sequence[int_scalars] = 10) – The number of equal-size bins to use.
If int, the number of bins for all dimensions (nx=ny=…=bins).
If [int, int, …], the number of bins in each dimension (nx, ny, … = bins).
Defaults to 10
Returns:
hist (pdarray) – shape(nx, ny, …, nd)
The multidimensional histogram of pdarrays in sample.
Values in first pdarray are histogrammed along the first dimension.
Values in second pdarray are histogrammed along the second dimension and so on.
edges (List[pdarray]) – A list of pdarrays containing the bin edges for each dimension.
Raises:
ValueError – Raised if bins < 1
NotImplementedError – Raised if pdarray dtype is bool or uint8
Test whether each element of a 1-D array is also present in a second array.
Returns a boolean array the same length as pda1 that is True
where an element of pda1 is in pda2 and False otherwise.
Support multi-level – test membership of rows of a in the set of rows of b.
Parameters:
a (list of pdarrays, pdarray, Strings, or Categorical) – Rows are elements for which to test membership in b
b (list of pdarrays, pdarray, Strings, or Categorical) – Rows are elements of the set in which to test membership
assume_unique (bool) – If true, assume rows of a and b are each unique and sorted.
By default, sort and unique them explicitly.
symmetric (bool) – Return in1d(pda1, pda2), in1d(pda2, pda1) when pda1 and 2 are single items.
invert (bool, optional) – If True, the values in the returned array are inverted (that is,
False where an element of pda1 is in pda2 and True otherwise).
Default is False. ak.in1d(a,b,invert=True) is equivalent
to (but is faster than) ~ak.in1d(a,b).
Returns:
True for each row in a that is contained in b
Return Type
———— – pdarray, bool
Notes
Only works for pdarrays of int64 dtype, float64, Strings, or Categorical
Test whether each element of a 1-D array is also present in a second array.
Returns a boolean array the same length as pda1 that is True
where an element of pda1 is in pda2 and False otherwise.
Support multi-level – test membership of rows of a in the set of rows of b.
Parameters:
a (list of pdarrays, pdarray, Strings, or Categorical) – Rows are elements for which to test membership in b
b (list of pdarrays, pdarray, Strings, or Categorical) – Rows are elements of the set in which to test membership
assume_unique (bool) – If true, assume rows of a and b are each unique and sorted.
By default, sort and unique them explicitly.
symmetric (bool) – Return in1d(pda1, pda2), in1d(pda2, pda1) when pda1 and 2 are single items.
invert (bool, optional) – If True, the values in the returned array are inverted (that is,
False where an element of pda1 is in pda2 and True otherwise).
Default is False. ak.in1d(a,b,invert=True) is equivalent
to (but is faster than) ~ak.in1d(a,b).
Returns:
True for each row in a that is contained in b
Return Type
———— – pdarray, bool
Notes
Only works for pdarrays of int64 dtype, float64, Strings, or Categorical
Return indices of query items in a search list of items. Items not found will be excluded.
When duplicate terms are present in search space return indices of all occurrences.
Parameters:
query ((sequence of) pdarray or Strings or Categorical) – The items to search for. If multiple arrays, each “row” is an item.
space ((sequence of) pdarray or Strings or Categorical) – The set of items in which to search. Must have same shape/dtype as query.
Returns:
indices – For each item in query, its index in space.
This is an alias of
ak.find(query, space, all_occurrences=True, remove_missing=True).values
Examples
>>> select_from=ak.arange(10)>>> arr1=select_from[ak.randint(0,select_from.size,20,seed=10)]>>> arr2=select_from[ak.randint(0,select_from.size,20,seed=11)]# remove some values to ensure we have some values# which don't appear in the search space>>> arr2=arr2[arr2!=9]>>> arr2=arr2[arr2!=3]
Returns JSON formatted string containing information about the objects in names
Parameters:
names (Union[List[str], str]) – names is either the name of an object or list of names of objects to retrieve info
if names is ak.AllSymbols, retrieves info for all symbols in the symbol table
if names is ak.RegisteredSymbols, retrieves info for all symbols in the registry
Returns:
JSON formatted string containing a list of information for each object in names
Return type:
str
Raises:
RuntimeError – Raised if a server-side error is thrown in the process of
retrieving information about the objects in names
This represents a generic version of type ‘origin’ with type arguments ‘params’.
There are two kind of these aliases: user defined and special. The special ones
are wrappers around builtin collections and ABCs in collections.abc. These must
have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated,
this is used by e.g. typing.List and typing.Dict.
This represents a generic version of type ‘origin’ with type arguments ‘params’.
There are two kind of these aliases: user defined and special. The special ones
are wrappers around builtin collections and ABCs in collections.abc. These must
have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated,
this is used by e.g. typing.List and typing.Dict.
This represents a generic version of type ‘origin’ with type arguments ‘params’.
There are two kind of these aliases: user defined and special. The special ones
are wrappers around builtin collections and ABCs in collections.abc. These must
have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated,
this is used by e.g. typing.List and typing.Dict.
positions (bool, default=True) – Return tuple of boolean pdarrays that indicate positions in a and b
of the intersection values.
unique (bool, default=False) – If the number of distinct values in a (and b) is equal to the size of
a (and b), there is a more efficient method to compute the intersection.
Returns:
The indices of a and b where any element occurs at least once in both
arrays.
This helper is intended to help future proof changes made to
accomodate IPv6 and to prevent errors if a user inadvertently
casts a IPv4 instead of a int64 pdarray. It can also be used
for importing Python lists of IP addresses into Arkouda.
Returns True if the type of element is a scalar type.
Parameters:
element (any) – Input argument, can be of any type and shape.
Returns:
val – True if element is a scalar type, False if it is not.
Return type:
bool
See also
ndim
Get the number of dimensions of an array
Notes
If you need a stricter way to identify a numerical scalar, use
isinstance(x,numbers.Number), as that returns False for most
non-numerical elements such as strings.
In most cases np.ndim(x)==0 should be used instead of this function,
as that will also return true for 0d arrays. This is how numpy overloads
functions in the style of the dx arguments to gradient and the bins
argument to histogram. Some key differences:
x
isscalar(x)
np.ndim(x)==0
PEP 3141 numeric objects (including
builtins)
True
True
builtin string and buffer objects
True
True
other builtin objects, like
pathlib.Path, Exception,
the result of re.compile
Determine if a class is a subclass of a second class.
issubclass_ is equivalent to the Python built-in issubclass,
except that it returns False instead of raising a TypeError if one
of the arguments is not a class.
Parameters:
arg1 (class) – Input class. True is returned if arg1 is a subclass of arg2.
arg2 (class or tuple of classes.) – Input class. If a tuple of classes, True is returned if arg1 is a
subclass of any of the tuple elements.
Load a pdarray previously saved with pdarray.save().
Parameters:
path_prefix (str) – Filename prefix used to save the original pdarray
file_format (str) – ‘INFER’, ‘HDF5’ or ‘Parquet’. Defaults to ‘INFER’. Used to indicate the file type being loaded.
If INFER, this will be detected during processing
dataset (str) – Dataset name where the pdarray was saved, defaults to ‘array’
calc_string_offsets (bool) – If True the server will ignore Segmented Strings ‘offsets’ array and derive
it from the null-byte terminators. Defaults to False currently
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
Returns:
Dictionary of {datsetName: Union[pdarray, Strings, SegArray, Categorical]}
with the previously saved pdarrays, Strings, SegArrays, or Categoricals
TypeError – Raised if either path_prefix or dataset is not a str
ValueError – Raised if invalid file_format or if the dataset is not present in all hdf5 files or if the
path_prefix does not correspond to files accessible to Arkouda
RuntimeError – Raised if the hdf5 files are present but there is an error in opening
one or more of them
If you have a previously saved Parquet file that is raising a FileNotFound error, try loading it
with a .parquet appended to the prefix_path.
Parquet files were previously ALWAYS stored with a .parquet extension.
ak.load does not support loading a single file.
For loading single HDF5 files without the _LOCALE#### suffix please use ak.read().
CSV files without the Arkouda Header are not supported.
Examples
>>> # Loading from file without extension>>> obj=ak.load('path/prefix')Loads the array from numLocales files with the name ``cwd/path/name_prefix_LOCALE####``.The file type is inferred during processing.
>>> # Loading with an extension (HDF5)>>> obj=ak.load('path/prefix.test')Loads the object from numLocales files with the name ``cwd/path/name_prefix_LOCALE####.test`` where#### is replaced by each locale numbers. Because filetype is inferred during processing,the extension is not required to be a specific format.
Load multiple pdarrays, Strings, SegArrays, or Categoricals previously
saved with save_all().
Parameters:
path_prefix (str) – Filename prefix used to save the original pdarray
file_format (str) – ‘INFER’, ‘HDF5’, ‘Parquet’, or ‘CSV’. Defaults to ‘INFER’. Indicates the format being loaded.
When ‘INFER’ the processing will detect the format
Defaults to ‘INFER’
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False,
SegArray (or other nested Parquet columns) will be ignored.
Parquet files only
Returns:
Dictionary of {datsetName: Union[pdarray, Strings, SegArray, Categorical]}
with the previously saved pdarrays, Strings, SegArrays, or Categoricals
ValueError – Raised if file_format/extension is encountered that is not hdf5 or parquet or
if all datasets are not present in all hdf5/parquet files or if the
path_prefix does not correspond to files accessible to Arkouda
RuntimeError – Raised if the hdf5 files are present but there is an error in opening
one or more of them
Return a pair of integers, whose ratio is exactly equal to the original
floating point number, and with a positive denominator.
Raise OverflowError on infinities and a ValueError on NaNs.
Return a pair of integers, whose ratio is exactly equal to the original
floating point number, and with a positive denominator.
Raise OverflowError on infinities and a ValueError on NaNs.
This function calls the h5ls utility on a HDF5 file visible to the
arkouda server or calls a function that imitates the result of h5ls
on a Parquet file.
Parameters:
filename (str) – The name of the file to pass to the server
col_delim (str) – The delimiter used to separate columns if the file is a csv
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False,
SegArray (or other nested Parquet columns) will be ignored.
Only used for Parquet files.
Returns:
The string output of the datasets from the server
Return type:
str
Raises:
TypeError – Raised if filename is not a str
ValueError – Raised if filename is empty or contains only whitespace
RuntimeError – Raised if error occurs in executing ls on an HDF5 file
Notes –
This will need to be updated because Parquet will not technically support this when we update.
Similar functionality will be added for Parquet in the future
TypeError – Raised if pda is not a pdarray or k is not an integer
ValueError – Raised if the pda is empty or k < 1
Notes
This call is equivalent in value to:
a[ak.argsort(a)[k:]]
and generally outperforms this operation.
This reduction will see a significant drop in performance as k grows
beyond a certain value. This value is system dependent, but generally
about a k of 5 million is where performance degredation has been observed.
Merge Arkouda DataFrames with a database-style join.
The resulting dataframe contains rows from both DataFrames as specified by
the merge condition (based on the “how” and “on” parameters).
left (DataFrame) – The Left DataFrame to be joined.
right (DataFrame) – The Right DataFrame to be joined.
on (Optional[Union[str, List[str]]] = None) – The name or list of names of the DataFrame column(s) to join on.
If on is None, this defaults to the intersection of the columns in both DataFrames.
how (str, default = "inner") – The merge condition.
Must be one of “inner”, “left”, “right”, or “outer”.
left_suffix (str, default = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping
column names in both left and right. Defaults to “_x”. Only used when how is “inner”.
right_suffix (str, default = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping
column names in both left and right. Defaults to “_y”. Only used when how is “inner”.
convert_ints (bool = True) – If True, convert columns with missing int values (due to the join) to float64.
This is to match pandas.
If False, do not convert the column dtypes.
This has no effect when how = “inner”.
sort (bool = True) – If True, DataFrame is returned sorted by “on”.
Otherwise, the DataFrame is not sorted.
This reduction will see a significant drop in performance as k grows
beyond a certain value. This value is system dependent, but generally
about a k of 5 million is where performance degredation has been observed.
This represents a generic version of type ‘origin’ with type arguments ‘params’.
There are two kind of these aliases: user defined and special. The special ones
are wrappers around builtin collections and ABCs in collections.abc. These must
have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated,
this is used by e.g. typing.List and typing.Dict.
This represents a generic version of type ‘origin’ with type arguments ‘params’.
There are two kind of these aliases: user defined and special. The special ones
are wrappers around builtin collections and ABCs in collections.abc. These must
have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated,
this is used by e.g. typing.List and typing.Dict.
This represents a generic version of type ‘origin’ with type arguments ‘params’.
There are two kind of these aliases: user defined and special. The special ones
are wrappers around builtin collections and ABCs in collections.abc. These must
have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated,
this is used by e.g. typing.List and typing.Dict.
The basic arkouda array class. This class contains only the
attributies of the array; the data resides on the arkouda
server. When a server operation results in a new array, arkouda
will create a pdarray instance that points to the array data on
the server. As such, the user should not initialize pdarray
instances directly.
Registered names/pdarrays in the server are immune to deletion
until they are unregistered.
Examples
>>> a=zeros(100)>>> a.register("my_zeros")>>> # potentially disconnect from server and reconnect to server>>> b=ak.pdarray.attach("my_zeros")>>> # ...other work...>>> b.unregister()
Creates a list of uint pdarrays from a bigint pdarray.
The first item in return will be the highest 64 bits of the
bigint pdarray and the last item will be the lowest 64 bits.
Returns:
A list of uint pdarrays where:
The first item in return will be the highest 64 bits of the
bigint pdarray and the last item will be the lowest 64 bits.
Return type:
List[pdarrays]
Raises:
RuntimeError – Raised if there is a server-side error thrown
Attempt to cast scalar other to the element dtype of this pdarray,
and print the resulting value to a string (e.g. for sending to a
server command). The user should not call this function directly.
Parameters:
other (object) – The scalar to be cast to the pdarray.dtype
Return type:
string representation of np.dtype corresponding to the other parameter
Raises:
TypeError – Raised if the other parameter cannot be converted to
Numpy dtype
Register this pdarray with a user defined name in the arkouda server
so it can be attached to later using pdarray.attach()
This is an in-place operation, registering a pdarray more than once will
update the name in the registry and remove the previously registered name.
A name can only be registered to one pdarray at a time.
Parameters:
user_defined_name (str) – user defined name array is to be registered under
Returns:
The same pdarray which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different pdarrays with the same name.
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the pdarray with the user_defined_name
If the user is attempting to register more than one pdarray with the same name,
the former should be unregistered first to free up the registration name.
Registered names/pdarrays in the server are immune to deletion
until they are unregistered.
Examples
>>> a=zeros(100)>>> a.register("my_zeros")>>> # potentially disconnect from server and reconnect to server>>> b=ak.pdarray.attach("my_zeros")>>> # ...other work...>>> b.unregister()
DEPRECATED
Save the pdarray to HDF5 or Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. HDF5 support single files, in which case the file name will
only be that provided. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If
‘Parquet’, the files will be written to the Parquet file format. This
is case insensitive.
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
ValueError – Raised if there is an error in parsing the prefix path pointing to
file write location or if the mode parameter is neither truncate
nor append
TypeError – Raised if any one of the prefix_path, dataset, or mode parameters
is not a string
The prefix_path must be visible to the arkouda server and the user must
have write permission.
Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales. If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
Previously all files saved in Parquet format were saved with a .parquet file extension.
This will require you to use load as if you saved the file with the extension. Try this if
an older file is not being found.
Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Examples
>>> a=ak.arange(25)>>> # Saving without an extension>>> a.save('path/prefix',dataset='array')Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``>>> # Saving with an extension (HDF5)>>> a.save('path/prefix.h5',dataset='array')Saves the array to numLocales HDF5 files with the name``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number>>> # Saving with an extension (Parquet)>>> a.save('path/prefix.parquet',dataset='array',file_format='Parquet')Saves the array in numLocales Parquet files with the name``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
Write pdarray to CSV file(s). File will contain a single column with the pdarray data.
All CSV Files written by Arkouda include a header denoting data types of the columns.
prefix_path: str
The filename prefix to be used for saving files. Files will have _LOCALE#### appended
when they are written to disk.
dataset: str
Column name to save the pdarray under. Defaults to “array”.
col_delim: str
Defaults to “,”. Value to be used to separate columns within the file.
Please be sure that the value used DOES NOT appear in your dataset.
overwrite: bool
Defaults to False. If True, any existing files matching your provided prefix_path will
be overwritten. If False, an error will be returned if existing files are found.
str reponse message
ValueError
Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist
RuntimeError
Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError
Raised if we receive an unknown arkouda_type returned from the server
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
Convert the array to a Numba DeviceND array, transferring array data from the
arkouda server to Python via ndarray. If the array exceeds a builtin size limit,
a RuntimeError is raised.
Returns:
A Numba ndarray with the same attributes and data as the pdarray; on GPU
Return type:
numba.DeviceNDArray
Raises:
ImportError – Raised if CUDA is not available
ModuleNotFoundError – Raised if Numba is either not installed or not enabled
RuntimeError – Raised if there is a server-side error thrown in the course of retrieving
the pdarray.
Notes
The number of bytes in the array cannot exceed client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting client.maxTransferBytes to a larger
value, but proceed with caution.
Save the pdarray to HDF5.
The object can be saved to a collection of files or single file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’. Otherwise,
the file name will be prefix_path.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Examples
>>> a=ak.arange(25)>>> # Saving without an extension>>> a.to_hdf('path/prefix',dataset='array')Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``>>> # Saving with an extension (HDF5)>>> a.to_hdf('path/prefix.h5',dataset='array')Saves the array to numLocales HDF5 files with the name``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number>>> # Saving to a single file>>> a.to_hdf('path/prefix.hdf5',dataset='array',file_type='single')Saves the array in to single hdf5 file on the root node.``cwd/path/name_prefix.hdf5``
Convert the array to a list, transferring array data from the
Arkouda server to client-side Python. Note: if the pdarray size exceeds
client.maxTransferBytes, a RuntimeError is raised.
Returns:
A list with the same data as the pdarray
Return type:
list
Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size
exceeds the built-in client.maxTransferBytes size limit, or if the bytes
received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting client.maxTransferBytes to a larger
value, but proceed with caution.
Convert the array to a np.ndarray, transferring array data from the
Arkouda server to client-side Python. Note: if the pdarray size exceeds
client.maxTransferBytes, a RuntimeError is raised.
Returns:
A numpy ndarray with the same attributes and data as the pdarray
Return type:
np.ndarray
Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size
exceeds the built-in client.maxTransferBytes size limit, or if the bytes
received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting client.maxTransferBytes to a larger
value, but proceed with caution.
Save the pdarray to Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’.
- ‘append’ write mode is supported, but is not efficient.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Examples
>>> a=ak.arange(25)>>> # Saving without an extension>>> a.to_parquet('path/prefix',dataset='array')Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``>>> # Saving with an extension (HDF5)>>> a.to_parqet('path/prefix.parquet',dataset='array')Saves the array to numLocales HDF5 files with the name``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
hostname (str) – The hostname where the Arkouda server intended to
receive the pdarray is running.
port (int_scalars) – The port to send the array over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
ak.receive_array().
Return type:
A message indicating a complete transfer
Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not
a supported dtype
Registered names/pdarrays in the server are immune to deletion until
they are unregistered.
Examples
>>> a=zeros(100)>>> a.register("my_zeros")>>> # potentially disconnect from server and reconnect to server>>> b=ak.pdarray.attach("my_zeros")>>> # ...other work...>>> b.unregister()
Overwrite the dataset with the name provided with this pdarray. If
the dataset does not exist it is added
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True
HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Return type:
str - success message if successful
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
The basic arkouda array class. This class contains only the
attributies of the array; the data resides on the arkouda
server. When a server operation results in a new array, arkouda
will create a pdarray instance that points to the array data on
the server. As such, the user should not initialize pdarray
instances directly.
Registered names/pdarrays in the server are immune to deletion
until they are unregistered.
Examples
>>> a=zeros(100)>>> a.register("my_zeros")>>> # potentially disconnect from server and reconnect to server>>> b=ak.pdarray.attach("my_zeros")>>> # ...other work...>>> b.unregister()
Creates a list of uint pdarrays from a bigint pdarray.
The first item in return will be the highest 64 bits of the
bigint pdarray and the last item will be the lowest 64 bits.
Returns:
A list of uint pdarrays where:
The first item in return will be the highest 64 bits of the
bigint pdarray and the last item will be the lowest 64 bits.
Return type:
List[pdarrays]
Raises:
RuntimeError – Raised if there is a server-side error thrown
Attempt to cast scalar other to the element dtype of this pdarray,
and print the resulting value to a string (e.g. for sending to a
server command). The user should not call this function directly.
Parameters:
other (object) – The scalar to be cast to the pdarray.dtype
Return type:
string representation of np.dtype corresponding to the other parameter
Raises:
TypeError – Raised if the other parameter cannot be converted to
Numpy dtype
Register this pdarray with a user defined name in the arkouda server
so it can be attached to later using pdarray.attach()
This is an in-place operation, registering a pdarray more than once will
update the name in the registry and remove the previously registered name.
A name can only be registered to one pdarray at a time.
Parameters:
user_defined_name (str) – user defined name array is to be registered under
Returns:
The same pdarray which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different pdarrays with the same name.
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the pdarray with the user_defined_name
If the user is attempting to register more than one pdarray with the same name,
the former should be unregistered first to free up the registration name.
Registered names/pdarrays in the server are immune to deletion
until they are unregistered.
Examples
>>> a=zeros(100)>>> a.register("my_zeros")>>> # potentially disconnect from server and reconnect to server>>> b=ak.pdarray.attach("my_zeros")>>> # ...other work...>>> b.unregister()
DEPRECATED
Save the pdarray to HDF5 or Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. HDF5 support single files, in which case the file name will
only be that provided. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If
‘Parquet’, the files will be written to the Parquet file format. This
is case insensitive.
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
ValueError – Raised if there is an error in parsing the prefix path pointing to
file write location or if the mode parameter is neither truncate
nor append
TypeError – Raised if any one of the prefix_path, dataset, or mode parameters
is not a string
The prefix_path must be visible to the arkouda server and the user must
have write permission.
Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales. If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
Previously all files saved in Parquet format were saved with a .parquet file extension.
This will require you to use load as if you saved the file with the extension. Try this if
an older file is not being found.
Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Examples
>>> a=ak.arange(25)>>> # Saving without an extension>>> a.save('path/prefix',dataset='array')Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``>>> # Saving with an extension (HDF5)>>> a.save('path/prefix.h5',dataset='array')Saves the array to numLocales HDF5 files with the name``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number>>> # Saving with an extension (Parquet)>>> a.save('path/prefix.parquet',dataset='array',file_format='Parquet')Saves the array in numLocales Parquet files with the name``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
Write pdarray to CSV file(s). File will contain a single column with the pdarray data.
All CSV Files written by Arkouda include a header denoting data types of the columns.
prefix_path: str
The filename prefix to be used for saving files. Files will have _LOCALE#### appended
when they are written to disk.
dataset: str
Column name to save the pdarray under. Defaults to “array”.
col_delim: str
Defaults to “,”. Value to be used to separate columns within the file.
Please be sure that the value used DOES NOT appear in your dataset.
overwrite: bool
Defaults to False. If True, any existing files matching your provided prefix_path will
be overwritten. If False, an error will be returned if existing files are found.
str reponse message
ValueError
Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist
RuntimeError
Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError
Raised if we receive an unknown arkouda_type returned from the server
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
Convert the array to a Numba DeviceND array, transferring array data from the
arkouda server to Python via ndarray. If the array exceeds a builtin size limit,
a RuntimeError is raised.
Returns:
A Numba ndarray with the same attributes and data as the pdarray; on GPU
Return type:
numba.DeviceNDArray
Raises:
ImportError – Raised if CUDA is not available
ModuleNotFoundError – Raised if Numba is either not installed or not enabled
RuntimeError – Raised if there is a server-side error thrown in the course of retrieving
the pdarray.
Notes
The number of bytes in the array cannot exceed client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting client.maxTransferBytes to a larger
value, but proceed with caution.
Save the pdarray to HDF5.
The object can be saved to a collection of files or single file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’. Otherwise,
the file name will be prefix_path.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Examples
>>> a=ak.arange(25)>>> # Saving without an extension>>> a.to_hdf('path/prefix',dataset='array')Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``>>> # Saving with an extension (HDF5)>>> a.to_hdf('path/prefix.h5',dataset='array')Saves the array to numLocales HDF5 files with the name``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number>>> # Saving to a single file>>> a.to_hdf('path/prefix.hdf5',dataset='array',file_type='single')Saves the array in to single hdf5 file on the root node.``cwd/path/name_prefix.hdf5``
Convert the array to a list, transferring array data from the
Arkouda server to client-side Python. Note: if the pdarray size exceeds
client.maxTransferBytes, a RuntimeError is raised.
Returns:
A list with the same data as the pdarray
Return type:
list
Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size
exceeds the built-in client.maxTransferBytes size limit, or if the bytes
received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting client.maxTransferBytes to a larger
value, but proceed with caution.
Convert the array to a np.ndarray, transferring array data from the
Arkouda server to client-side Python. Note: if the pdarray size exceeds
client.maxTransferBytes, a RuntimeError is raised.
Returns:
A numpy ndarray with the same attributes and data as the pdarray
Return type:
np.ndarray
Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size
exceeds the built-in client.maxTransferBytes size limit, or if the bytes
received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting client.maxTransferBytes to a larger
value, but proceed with caution.
Save the pdarray to Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’.
- ‘append’ write mode is supported, but is not efficient.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Examples
>>> a=ak.arange(25)>>> # Saving without an extension>>> a.to_parquet('path/prefix',dataset='array')Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``>>> # Saving with an extension (HDF5)>>> a.to_parqet('path/prefix.parquet',dataset='array')Saves the array to numLocales HDF5 files with the name``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
hostname (str) – The hostname where the Arkouda server intended to
receive the pdarray is running.
port (int_scalars) – The port to send the array over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
ak.receive_array().
Return type:
A message indicating a complete transfer
Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not
a supported dtype
Registered names/pdarrays in the server are immune to deletion until
they are unregistered.
Examples
>>> a=zeros(100)>>> a.register("my_zeros")>>> # potentially disconnect from server and reconnect to server>>> b=ak.pdarray.attach("my_zeros")>>> # ...other work...>>> b.unregister()
Overwrite the dataset with the name provided with this pdarray. If
the dataset does not exist it is added
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True
HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Return type:
str - success message if successful
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
The basic arkouda array class. This class contains only the
attributies of the array; the data resides on the arkouda
server. When a server operation results in a new array, arkouda
will create a pdarray instance that points to the array data on
the server. As such, the user should not initialize pdarray
instances directly.
Registered names/pdarrays in the server are immune to deletion
until they are unregistered.
Examples
>>> a=zeros(100)>>> a.register("my_zeros")>>> # potentially disconnect from server and reconnect to server>>> b=ak.pdarray.attach("my_zeros")>>> # ...other work...>>> b.unregister()
Creates a list of uint pdarrays from a bigint pdarray.
The first item in return will be the highest 64 bits of the
bigint pdarray and the last item will be the lowest 64 bits.
Returns:
A list of uint pdarrays where:
The first item in return will be the highest 64 bits of the
bigint pdarray and the last item will be the lowest 64 bits.
Return type:
List[pdarrays]
Raises:
RuntimeError – Raised if there is a server-side error thrown
Attempt to cast scalar other to the element dtype of this pdarray,
and print the resulting value to a string (e.g. for sending to a
server command). The user should not call this function directly.
Parameters:
other (object) – The scalar to be cast to the pdarray.dtype
Return type:
string representation of np.dtype corresponding to the other parameter
Raises:
TypeError – Raised if the other parameter cannot be converted to
Numpy dtype
Register this pdarray with a user defined name in the arkouda server
so it can be attached to later using pdarray.attach()
This is an in-place operation, registering a pdarray more than once will
update the name in the registry and remove the previously registered name.
A name can only be registered to one pdarray at a time.
Parameters:
user_defined_name (str) – user defined name array is to be registered under
Returns:
The same pdarray which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different pdarrays with the same name.
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the pdarray with the user_defined_name
If the user is attempting to register more than one pdarray with the same name,
the former should be unregistered first to free up the registration name.
Registered names/pdarrays in the server are immune to deletion
until they are unregistered.
Examples
>>> a=zeros(100)>>> a.register("my_zeros")>>> # potentially disconnect from server and reconnect to server>>> b=ak.pdarray.attach("my_zeros")>>> # ...other work...>>> b.unregister()
DEPRECATED
Save the pdarray to HDF5 or Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. HDF5 support single files, in which case the file name will
only be that provided. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If
‘Parquet’, the files will be written to the Parquet file format. This
is case insensitive.
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
ValueError – Raised if there is an error in parsing the prefix path pointing to
file write location or if the mode parameter is neither truncate
nor append
TypeError – Raised if any one of the prefix_path, dataset, or mode parameters
is not a string
The prefix_path must be visible to the arkouda server and the user must
have write permission.
Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales. If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
Previously all files saved in Parquet format were saved with a .parquet file extension.
This will require you to use load as if you saved the file with the extension. Try this if
an older file is not being found.
Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Examples
>>> a=ak.arange(25)>>> # Saving without an extension>>> a.save('path/prefix',dataset='array')Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``>>> # Saving with an extension (HDF5)>>> a.save('path/prefix.h5',dataset='array')Saves the array to numLocales HDF5 files with the name``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number>>> # Saving with an extension (Parquet)>>> a.save('path/prefix.parquet',dataset='array',file_format='Parquet')Saves the array in numLocales Parquet files with the name``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
Write pdarray to CSV file(s). File will contain a single column with the pdarray data.
All CSV Files written by Arkouda include a header denoting data types of the columns.
prefix_path: str
The filename prefix to be used for saving files. Files will have _LOCALE#### appended
when they are written to disk.
dataset: str
Column name to save the pdarray under. Defaults to “array”.
col_delim: str
Defaults to “,”. Value to be used to separate columns within the file.
Please be sure that the value used DOES NOT appear in your dataset.
overwrite: bool
Defaults to False. If True, any existing files matching your provided prefix_path will
be overwritten. If False, an error will be returned if existing files are found.
str reponse message
ValueError
Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist
RuntimeError
Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError
Raised if we receive an unknown arkouda_type returned from the server
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
Convert the array to a Numba DeviceND array, transferring array data from the
arkouda server to Python via ndarray. If the array exceeds a builtin size limit,
a RuntimeError is raised.
Returns:
A Numba ndarray with the same attributes and data as the pdarray; on GPU
Return type:
numba.DeviceNDArray
Raises:
ImportError – Raised if CUDA is not available
ModuleNotFoundError – Raised if Numba is either not installed or not enabled
RuntimeError – Raised if there is a server-side error thrown in the course of retrieving
the pdarray.
Notes
The number of bytes in the array cannot exceed client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting client.maxTransferBytes to a larger
value, but proceed with caution.
Save the pdarray to HDF5.
The object can be saved to a collection of files or single file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’. Otherwise,
the file name will be prefix_path.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Examples
>>> a=ak.arange(25)>>> # Saving without an extension>>> a.to_hdf('path/prefix',dataset='array')Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``>>> # Saving with an extension (HDF5)>>> a.to_hdf('path/prefix.h5',dataset='array')Saves the array to numLocales HDF5 files with the name``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number>>> # Saving to a single file>>> a.to_hdf('path/prefix.hdf5',dataset='array',file_type='single')Saves the array in to single hdf5 file on the root node.``cwd/path/name_prefix.hdf5``
Convert the array to a list, transferring array data from the
Arkouda server to client-side Python. Note: if the pdarray size exceeds
client.maxTransferBytes, a RuntimeError is raised.
Returns:
A list with the same data as the pdarray
Return type:
list
Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size
exceeds the built-in client.maxTransferBytes size limit, or if the bytes
received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting client.maxTransferBytes to a larger
value, but proceed with caution.
Convert the array to a np.ndarray, transferring array data from the
Arkouda server to client-side Python. Note: if the pdarray size exceeds
client.maxTransferBytes, a RuntimeError is raised.
Returns:
A numpy ndarray with the same attributes and data as the pdarray
Return type:
np.ndarray
Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size
exceeds the built-in client.maxTransferBytes size limit, or if the bytes
received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting client.maxTransferBytes to a larger
value, but proceed with caution.
Save the pdarray to Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’.
- ‘append’ write mode is supported, but is not efficient.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Examples
>>> a=ak.arange(25)>>> # Saving without an extension>>> a.to_parquet('path/prefix',dataset='array')Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``>>> # Saving with an extension (HDF5)>>> a.to_parqet('path/prefix.parquet',dataset='array')Saves the array to numLocales HDF5 files with the name``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
hostname (str) – The hostname where the Arkouda server intended to
receive the pdarray is running.
port (int_scalars) – The port to send the array over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
ak.receive_array().
Return type:
A message indicating a complete transfer
Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not
a supported dtype
Registered names/pdarrays in the server are immune to deletion until
they are unregistered.
Examples
>>> a=zeros(100)>>> a.register("my_zeros")>>> # potentially disconnect from server and reconnect to server>>> b=ak.pdarray.attach("my_zeros")>>> # ...other work...>>> b.unregister()
Overwrite the dataset with the name provided with this pdarray. If
the dataset does not exist it is added
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True
HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Return type:
str - success message if successful
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
The basic arkouda array class. This class contains only the
attributies of the array; the data resides on the arkouda
server. When a server operation results in a new array, arkouda
will create a pdarray instance that points to the array data on
the server. As such, the user should not initialize pdarray
instances directly.
Registered names/pdarrays in the server are immune to deletion
until they are unregistered.
Examples
>>> a=zeros(100)>>> a.register("my_zeros")>>> # potentially disconnect from server and reconnect to server>>> b=ak.pdarray.attach("my_zeros")>>> # ...other work...>>> b.unregister()
Creates a list of uint pdarrays from a bigint pdarray.
The first item in return will be the highest 64 bits of the
bigint pdarray and the last item will be the lowest 64 bits.
Returns:
A list of uint pdarrays where:
The first item in return will be the highest 64 bits of the
bigint pdarray and the last item will be the lowest 64 bits.
Return type:
List[pdarrays]
Raises:
RuntimeError – Raised if there is a server-side error thrown
Attempt to cast scalar other to the element dtype of this pdarray,
and print the resulting value to a string (e.g. for sending to a
server command). The user should not call this function directly.
Parameters:
other (object) – The scalar to be cast to the pdarray.dtype
Return type:
string representation of np.dtype corresponding to the other parameter
Raises:
TypeError – Raised if the other parameter cannot be converted to
Numpy dtype
Register this pdarray with a user defined name in the arkouda server
so it can be attached to later using pdarray.attach()
This is an in-place operation, registering a pdarray more than once will
update the name in the registry and remove the previously registered name.
A name can only be registered to one pdarray at a time.
Parameters:
user_defined_name (str) – user defined name array is to be registered under
Returns:
The same pdarray which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different pdarrays with the same name.
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the pdarray with the user_defined_name
If the user is attempting to register more than one pdarray with the same name,
the former should be unregistered first to free up the registration name.
Registered names/pdarrays in the server are immune to deletion
until they are unregistered.
Examples
>>> a=zeros(100)>>> a.register("my_zeros")>>> # potentially disconnect from server and reconnect to server>>> b=ak.pdarray.attach("my_zeros")>>> # ...other work...>>> b.unregister()
DEPRECATED
Save the pdarray to HDF5 or Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. HDF5 support single files, in which case the file name will
only be that provided. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If
‘Parquet’, the files will be written to the Parquet file format. This
is case insensitive.
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
ValueError – Raised if there is an error in parsing the prefix path pointing to
file write location or if the mode parameter is neither truncate
nor append
TypeError – Raised if any one of the prefix_path, dataset, or mode parameters
is not a string
The prefix_path must be visible to the arkouda server and the user must
have write permission.
Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales. If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
Previously all files saved in Parquet format were saved with a .parquet file extension.
This will require you to use load as if you saved the file with the extension. Try this if
an older file is not being found.
Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Examples
>>> a=ak.arange(25)>>> # Saving without an extension>>> a.save('path/prefix',dataset='array')Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``>>> # Saving with an extension (HDF5)>>> a.save('path/prefix.h5',dataset='array')Saves the array to numLocales HDF5 files with the name``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number>>> # Saving with an extension (Parquet)>>> a.save('path/prefix.parquet',dataset='array',file_format='Parquet')Saves the array in numLocales Parquet files with the name``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
Write pdarray to CSV file(s). File will contain a single column with the pdarray data.
All CSV Files written by Arkouda include a header denoting data types of the columns.
prefix_path: str
The filename prefix to be used for saving files. Files will have _LOCALE#### appended
when they are written to disk.
dataset: str
Column name to save the pdarray under. Defaults to “array”.
col_delim: str
Defaults to “,”. Value to be used to separate columns within the file.
Please be sure that the value used DOES NOT appear in your dataset.
overwrite: bool
Defaults to False. If True, any existing files matching your provided prefix_path will
be overwritten. If False, an error will be returned if existing files are found.
str reponse message
ValueError
Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist
RuntimeError
Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError
Raised if we receive an unknown arkouda_type returned from the server
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
Convert the array to a Numba DeviceND array, transferring array data from the
arkouda server to Python via ndarray. If the array exceeds a builtin size limit,
a RuntimeError is raised.
Returns:
A Numba ndarray with the same attributes and data as the pdarray; on GPU
Return type:
numba.DeviceNDArray
Raises:
ImportError – Raised if CUDA is not available
ModuleNotFoundError – Raised if Numba is either not installed or not enabled
RuntimeError – Raised if there is a server-side error thrown in the course of retrieving
the pdarray.
Notes
The number of bytes in the array cannot exceed client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting client.maxTransferBytes to a larger
value, but proceed with caution.
Save the pdarray to HDF5.
The object can be saved to a collection of files or single file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’. Otherwise,
the file name will be prefix_path.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Examples
>>> a=ak.arange(25)>>> # Saving without an extension>>> a.to_hdf('path/prefix',dataset='array')Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``>>> # Saving with an extension (HDF5)>>> a.to_hdf('path/prefix.h5',dataset='array')Saves the array to numLocales HDF5 files with the name``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number>>> # Saving to a single file>>> a.to_hdf('path/prefix.hdf5',dataset='array',file_type='single')Saves the array in to single hdf5 file on the root node.``cwd/path/name_prefix.hdf5``
Convert the array to a list, transferring array data from the
Arkouda server to client-side Python. Note: if the pdarray size exceeds
client.maxTransferBytes, a RuntimeError is raised.
Returns:
A list with the same data as the pdarray
Return type:
list
Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size
exceeds the built-in client.maxTransferBytes size limit, or if the bytes
received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting client.maxTransferBytes to a larger
value, but proceed with caution.
Convert the array to a np.ndarray, transferring array data from the
Arkouda server to client-side Python. Note: if the pdarray size exceeds
client.maxTransferBytes, a RuntimeError is raised.
Returns:
A numpy ndarray with the same attributes and data as the pdarray
Return type:
np.ndarray
Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size
exceeds the built-in client.maxTransferBytes size limit, or if the bytes
received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting client.maxTransferBytes to a larger
value, but proceed with caution.
Save the pdarray to Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’.
- ‘append’ write mode is supported, but is not efficient.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Examples
>>> a=ak.arange(25)>>> # Saving without an extension>>> a.to_parquet('path/prefix',dataset='array')Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``>>> # Saving with an extension (HDF5)>>> a.to_parqet('path/prefix.parquet',dataset='array')Saves the array to numLocales HDF5 files with the name``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
hostname (str) – The hostname where the Arkouda server intended to
receive the pdarray is running.
port (int_scalars) – The port to send the array over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
ak.receive_array().
Return type:
A message indicating a complete transfer
Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not
a supported dtype
Registered names/pdarrays in the server are immune to deletion until
they are unregistered.
Examples
>>> a=zeros(100)>>> a.register("my_zeros")>>> # potentially disconnect from server and reconnect to server>>> b=ak.pdarray.attach("my_zeros")>>> # ...other work...>>> b.unregister()
Overwrite the dataset with the name provided with this pdarray. If
the dataset does not exist it is added
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True
HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Return type:
str - success message if successful
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
The basic arkouda array class. This class contains only the
attributies of the array; the data resides on the arkouda
server. When a server operation results in a new array, arkouda
will create a pdarray instance that points to the array data on
the server. As such, the user should not initialize pdarray
instances directly.
Registered names/pdarrays in the server are immune to deletion
until they are unregistered.
Examples
>>> a=zeros(100)>>> a.register("my_zeros")>>> # potentially disconnect from server and reconnect to server>>> b=ak.pdarray.attach("my_zeros")>>> # ...other work...>>> b.unregister()
Creates a list of uint pdarrays from a bigint pdarray.
The first item in return will be the highest 64 bits of the
bigint pdarray and the last item will be the lowest 64 bits.
Returns:
A list of uint pdarrays where:
The first item in return will be the highest 64 bits of the
bigint pdarray and the last item will be the lowest 64 bits.
Return type:
List[pdarrays]
Raises:
RuntimeError – Raised if there is a server-side error thrown
Attempt to cast scalar other to the element dtype of this pdarray,
and print the resulting value to a string (e.g. for sending to a
server command). The user should not call this function directly.
Parameters:
other (object) – The scalar to be cast to the pdarray.dtype
Return type:
string representation of np.dtype corresponding to the other parameter
Raises:
TypeError – Raised if the other parameter cannot be converted to
Numpy dtype
Register this pdarray with a user defined name in the arkouda server
so it can be attached to later using pdarray.attach()
This is an in-place operation, registering a pdarray more than once will
update the name in the registry and remove the previously registered name.
A name can only be registered to one pdarray at a time.
Parameters:
user_defined_name (str) – user defined name array is to be registered under
Returns:
The same pdarray which is now registered with the arkouda server and has an updated name.
This is an in-place modification, the original is returned to support a
fluid programming style.
Please note you cannot register two different pdarrays with the same name.
TypeError – Raised if user_defined_name is not a str
RegistrationError – If the server was unable to register the pdarray with the user_defined_name
If the user is attempting to register more than one pdarray with the same name,
the former should be unregistered first to free up the registration name.
Registered names/pdarrays in the server are immune to deletion
until they are unregistered.
Examples
>>> a=zeros(100)>>> a.register("my_zeros")>>> # potentially disconnect from server and reconnect to server>>> b=ak.pdarray.attach("my_zeros")>>> # ...other work...>>> b.unregister()
DEPRECATED
Save the pdarray to HDF5 or Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. HDF5 support single files, in which case the file name will
only be that provided. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If
‘Parquet’, the files will be written to the Parquet file format. This
is case insensitive.
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
ValueError – Raised if there is an error in parsing the prefix path pointing to
file write location or if the mode parameter is neither truncate
nor append
TypeError – Raised if any one of the prefix_path, dataset, or mode parameters
is not a string
The prefix_path must be visible to the arkouda server and the user must
have write permission.
Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales. If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
Previously all files saved in Parquet format were saved with a .parquet file extension.
This will require you to use load as if you saved the file with the extension. Try this if
an older file is not being found.
Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Examples
>>> a=ak.arange(25)>>> # Saving without an extension>>> a.save('path/prefix',dataset='array')Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``>>> # Saving with an extension (HDF5)>>> a.save('path/prefix.h5',dataset='array')Saves the array to numLocales HDF5 files with the name``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number>>> # Saving with an extension (Parquet)>>> a.save('path/prefix.parquet',dataset='array',file_format='Parquet')Saves the array in numLocales Parquet files with the name``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
Write pdarray to CSV file(s). File will contain a single column with the pdarray data.
All CSV Files written by Arkouda include a header denoting data types of the columns.
prefix_path: str
The filename prefix to be used for saving files. Files will have _LOCALE#### appended
when they are written to disk.
dataset: str
Column name to save the pdarray under. Defaults to “array”.
col_delim: str
Defaults to “,”. Value to be used to separate columns within the file.
Please be sure that the value used DOES NOT appear in your dataset.
overwrite: bool
Defaults to False. If True, any existing files matching your provided prefix_path will
be overwritten. If False, an error will be returned if existing files are found.
str reponse message
ValueError
Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist
RuntimeError
Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError
Raised if we receive an unknown arkouda_type returned from the server
CSV format is not currently supported by load/load_all operations
The column delimiter is expected to be the same for column names and data
Be sure that column delimiters are not found within your data.
Convert the array to a Numba DeviceND array, transferring array data from the
arkouda server to Python via ndarray. If the array exceeds a builtin size limit,
a RuntimeError is raised.
Returns:
A Numba ndarray with the same attributes and data as the pdarray; on GPU
Return type:
numba.DeviceNDArray
Raises:
ImportError – Raised if CUDA is not available
ModuleNotFoundError – Raised if Numba is either not installed or not enabled
RuntimeError – Raised if there is a server-side error thrown in the course of retrieving
the pdarray.
Notes
The number of bytes in the array cannot exceed client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting client.maxTransferBytes to a larger
value, but proceed with caution.
Save the pdarray to HDF5.
The object can be saved to a collection of files or single file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
file_type (str ("single" | "distribute")) – Default: “distribute”
When set to single, dataset is written to a single file.
When distribute, dataset is written on a file per locale.
This is only supported by HDF5 files and will have no impact of Parquet Files.
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’. Otherwise,
the file name will be prefix_path.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Examples
>>> a=ak.arange(25)>>> # Saving without an extension>>> a.to_hdf('path/prefix',dataset='array')Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``>>> # Saving with an extension (HDF5)>>> a.to_hdf('path/prefix.h5',dataset='array')Saves the array to numLocales HDF5 files with the name``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number>>> # Saving to a single file>>> a.to_hdf('path/prefix.hdf5',dataset='array',file_type='single')Saves the array in to single hdf5 file on the root node.``cwd/path/name_prefix.hdf5``
Convert the array to a list, transferring array data from the
Arkouda server to client-side Python. Note: if the pdarray size exceeds
client.maxTransferBytes, a RuntimeError is raised.
Returns:
A list with the same data as the pdarray
Return type:
list
Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size
exceeds the built-in client.maxTransferBytes size limit, or if the bytes
received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting client.maxTransferBytes to a larger
value, but proceed with caution.
Convert the array to a np.ndarray, transferring array data from the
Arkouda server to client-side Python. Note: if the pdarray size exceeds
client.maxTransferBytes, a RuntimeError is raised.
Returns:
A numpy ndarray with the same attributes and data as the pdarray
Return type:
np.ndarray
Raises:
RuntimeError – Raised if there is a server-side error thrown, if the pdarray size
exceeds the built-in client.maxTransferBytes size limit, or if the bytes
received does not match expected number of bytes
Notes
The number of bytes in the array cannot exceed client.maxTransferBytes,
otherwise a RuntimeError will be raised. This is to protect the user
from overflowing the memory of the system on which the Python client
is running, under the assumption that the server is running on a
distributed system with much more memory than the client. The user
may override this limit by setting client.maxTransferBytes to a larger
value, but proceed with caution.
Save the pdarray to Parquet. The result is a collection of files,
one file per locale of the arkouda server, where each filename starts
with prefix_path. Each locale saves its chunk of the array to its
corresponding file.
:param prefix_path: Directory and filename prefix that all output files share
:type prefix_path: str
:param dataset: Name of the dataset to create in files (must not already exist)
:type dataset: str
:param mode: By default, truncate (overwrite) output files, if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”)
Sets the compression type used with Parquet files
Return type:
string message indicating result of save operation
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
The prefix_path must be visible to the arkouda server and the user must
have write permission.
- Output files have names of the form <prefix_path>_LOCALE<i>, where <i>
ranges from 0 to numLocales for file_type=’distribute’.
- ‘append’ write mode is supported, but is not efficient.
- If any of the output files already exist and
the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’
and the number of output files is less than the number of locales or a
dataset with the same name already exists, a RuntimeError will result.
- Any file extension can be used.The file I/O does not rely on the extension to
determine the file format.
Examples
>>> a=ak.arange(25)>>> # Saving without an extension>>> a.to_parquet('path/prefix',dataset='array')Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``>>> # Saving with an extension (HDF5)>>> a.to_parqet('path/prefix.parquet',dataset='array')Saves the array to numLocales HDF5 files with the name``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
hostname (str) – The hostname where the Arkouda server intended to
receive the pdarray is running.
port (int_scalars) – The port to send the array over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
ak.receive_array().
Return type:
A message indicating a complete transfer
Raises:
ValueError – Raised if the op is not within the pdarray.BinOps set
TypeError – Raised if other is not a pdarray or the pdarray.dtype is not
a supported dtype
Registered names/pdarrays in the server are immune to deletion until
they are unregistered.
Examples
>>> a=zeros(100)>>> a.register("my_zeros")>>> # potentially disconnect from server and reconnect to server>>> b=ak.pdarray.attach("my_zeros")>>> # ...other work...>>> b.unregister()
Overwrite the dataset with the name provided with this pdarray. If
the dataset does not exist it is added
Parameters:
prefix_path (str) – Directory and filename prefix that all output files share
dataset (str) – Name of the dataset to create in files
repack (bool) – Default: True
HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Return type:
str - success message if successful
Raises:
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the dataset provided does not exist, it will be added
Plot the distribution and cumulative distribution of histogram Data
Parameters:
b (np.ndarray) – Bin edges
h (np.ndarray) – Histogram data
log (bool) – use log to scale y
xlabel (str) – Label for the x axis of the graph
newfig (bool) – Generate a new figure or not
Notes
This function does not return or display the plot. A user must have matplotlib imported in
addition to arkouda to display plots. This could be updated to return the object or have a
flag to show the resulting plots.
See Examples Below.
Examples
>>> importarkoudaasak>>> frommatplotlibimportpyplotasplt>>> b,h=ak.histogram(ak.arange(10),3)>>> ak.plot_dist(b,h.to_ndarray())>>> # to show the plot>>> plt.show()
Raises an array to a power. If where is given, the operation will only take place in the positions
where the where condition is True.
Note:
Our implementation of the where argument deviates from numpy. The difference in behavior occurs
at positions where the where argument contains a False. In numpy, these position will have
uninitialized memory (which can contain anything and will vary between runs). We have chosen to
instead return the value of the original array in these positions.
Parameters:
pda (pdarray) – A pdarray of values that will be raised to a power (pwr)
pwr (integer, float, or pdarray) – The power(s) that pda is raised to
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the
corresponding value will be raised to the respective power. Elsewhere, it will retain its
original value. Default set to True.
Returns:
pdarray
Returns a pdarray of values raised to a power, under the boolean where condition.
f_exp (pdarray, default = None) – The expected frequency.
ddof (int) – The delta degrees of freedom.
lambda (string, default = "pearson") –
The power in the Cressie-Read power divergence statistic.
Allowed values: “pearson”, “log-likelihood”, “freeman-tukey”, “mod-log-likelihood”,
“neyman”, “cressie-read”
Prints verbose information for each object in names in a human readable format
Parameters:
names (Union[List[str], str]) – names is either the name of an object or list of names of objects to retrieve info
if names is ak.AllSymbols, retrieves info for all symbols in the symbol table
if names is ak.RegisteredSymbols, retrieves info for all symbols in the registry
Return type:
None
Raises:
RuntimeError – Raised if a server-side error is thrown in the process of
retrieving information about the objects in names
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the
corresponding value will be converted from radians to degrees. Elsewhere, it will retain its
original value. Default set to True.
Returns:
A pdarray containing an angle converted to degrees, from radians, for each element
of the original pdarray
Create a random sparse matrix with the specified number of rows and columns
and the specified density. The density is the fraction of non-zero elements
in the matrix. The non-zero elements are uniformly distributed random
numbers in the range [0,1).
Parameters:
size (int) – The number of rows in the matrix, columns are equal to rows right now
density (float) – The fraction of non-zero elements in the matrix
dtype (Union[DTypes, str]) – The dtype of the elements in the matrix
Returns:
A sparse matrix with the specified number of rows and columns
and the specified density
The lengths of the generated strings are distributed $Lognormal(mu, sigma^2)$,
with \(\mu = logmean\) and \(\sigma = logstd\). Thus, the strings will
have an average length of \(exp(\mu + 0.5*\sigma^2)\), a minimum length of
zero, and a heavy tail towards longer strings.
Read datasets from files.
File Type is determined automatically.
Parameters:
filenames (list or str) – Either a list of filenames or shell expression
datasets (list or str or None) – (List of) name(s) of dataset(s) to read (default: all available)
iterative (bool) – Iterative (True) or Single (False) function call(s) to server
strictTypes (bool) – If True (default), require all dtypes of a given dataset to have the
same precision and sign. If False, allow dtypes of different
precision and sign across different files. For example, if one
file contains a uint32 dataset and another contains an int64
dataset with the same name, the contents of both will be read
into an int64 pdarray.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped
instead of failing. A warning will be included in the return containing
the total number of files skipped due to failure and up to 10 filenames.
calc_string_offsets (bool) – Default False, if True this will tell the server to calculate the
offsets/segments array on the server versus loading them from HDF5 files.
In the future this option may be set to True as the default.
column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False,
SegArray (or other nested Parquet columns) will be ignored.
Ignored if datasets is not None
Parquet Files only.
has_non_float_nulls (bool) – Default False. This flag must be set to True to read non-float parquet columns
that contain null values.
fixed_len (int) – Default -1. This value can be set for reading Parquet string columns when the
length of each string is known at runtime. This can allow for skipping byte
calculation, which can have an impact on performance.
Returns:
Dictionary of {datasetName: pdarray, String, or SegArray}
Return type:
Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.
If filenames is a string, it is interpreted as a shell expression
(a single filename is a valid expression, so it will work) and is
expanded with glob to read all matching files.
If iterative == True each dataset name and file names are passed to
the server as independent sequential strings while if iterative == False
all dataset names and file names are passed to the server in a single
string.
If datasets is None, infer the names of datasets from the first file
and read all of them. Use get_datasets to show the names of datasets
to HDF5/Parquet files.
CSV files without the Arkouda Header are not supported.
Examples
Read with file Extension
>>> x = ak.read(‘path/name_prefix.h5’) # load HDF5 - processing determines file type not extension
Read without file Extension
>>> x = ak.read(‘path/name_prefix.parquet’) # load Parquet
Read Glob Expression
>>> x = ak.read(‘path/name_prefix*’) # Reads HDF5
Read CSV file(s) into Arkouda objects. If more than one dataset is found, the objects
will be returned in a dictionary mapping the dataset name to the Arkouda object
containing the data. If the file contains the appropriately formatted header, typed
data will be returned. Otherwise, all data will be returned as a Strings object.
Parameters:
filenames (str or List[str]) – The filenames to read data from
datasets (str or List[str] (Optional)) – names of the datasets to read. When None, all datasets will be read.
column_delim (str) – The delimiter for column names and data. Defaults to “,”.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped
instead of failing. A warning will be included in the return containing
the total number of files skipped due to failure and up to 10 filenames.
Returns:
Dictionary of {datasetName: pdarray, String, or SegArray}
Return type:
Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.
Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
filenames (str, List[str]) – Filename/s to read objects from
datasets (Optional str, List[str]) – datasets to read from the provided files
iterative (bool) – Iterative (True) or Single (False) function call(s) to server
strict_types (bool) – If True (default), require all dtypes of a given dataset to have the
same precision and sign. If False, allow dtypes of different
precision and sign across different files. For example, if one
file contains a uint32 dataset and another contains an int64
dataset with the same name, the contents of both will be read
into an int64 pdarray.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped
instead of failing. A warning will be included in the return containing
the total number of files skipped due to failure and up to 10 filenames.
calc_string_offsets (bool) – Default False, if True this will tell the server to calculate the
offsets/segments array on the server versus loading them from HDF5 files.
In the future this option may be set to True as the default.
tagData (bool) – Default False, if True tag the data with the code associated with the filename
that the data was pulled from.
Returns:
Dictionary of {datasetName: pdarray, String, SegArray}
Return type:
Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.
Raises:
ValueError – Raised if all datasets are not present in all hdf5 files or if one or
more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
If filenames is a string, it is interpreted as a shell expression
(a single filename is a valid expression, so it will work) and is
expanded with glob to read all matching files.
If iterative == True each dataset name and file names are passed to
the server as independent sequential strings while if iterative == False
all dataset names and file names are passed to the server in a single
string.
If datasets is None, infer the names of datasets from the first file
and read all of them. Use get_datasets to show the names of datasets
to HDF5 files.
filenames (str, List[str]) – Filename/s to read objects from
datasets (Optional str, List[str]) – datasets to read from the provided files
iterative (bool) – Iterative (True) or Single (False) function call(s) to server
strict_types (bool) – If True (default), require all dtypes of a given dataset to have the
same precision and sign. If False, allow dtypes of different
precision and sign across different files. For example, if one
file contains a uint32 dataset and another contains an int64
dataset with the same name, the contents of both will be read
into an int64 pdarray.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped
instead of failing. A warning will be included in the return containing
the total number of files skipped due to failure and up to 10 filenames.
tagData (bool) – Default False, if True tag the data with the code associated with the filename
that the data was pulled from.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False,
SegArray (or other nested Parquet columns) will be ignored.
If datasets is not None, this will be ignored.
has_non_float_nulls (bool) – Default False. This flag must be set to True to read non-float parquet columns
that contain null values.
fixed_len (int) – Default -1. This value can be set for reading Parquet string columns when the
length of each string is known at runtime. This can allow for skipping byte
calculation, which can have an impact on performance.
Returns:
Dictionary of {datasetName: pdarray, String, or SegArray}
Return type:
Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.
Raises:
ValueError – Raised if all datasets are not present in all parquet files or if one or
more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
Notes
If filenames is a string, it is interpreted as a shell expression
(a single filename is a valid expression, so it will work) and is
expanded with glob to read all matching files.
If iterative == True each dataset name and file names are passed to
the server as independent sequential strings while if iterative == False
all dataset names and file names are passed to the server in a single
string.
If datasets is None, infer the names of datasets from the first file
and read all of them. Use get_datasets to show the names of datasets
to Parquet files.
Parquet always recomputes offsets at this time
This will need to be updated once parquets workflow is updated
Read datasets from files and tag each record to the file it was read from.
File Type is determined automatically.
Parameters:
filenames (list or str) – Either a list of filenames or shell expression
datasets (list or str or None) – (List of) name(s) of dataset(s) to read (default: all available)
strictTypes (bool) – If True (default), require all dtypes of a given dataset to have the
same precision and sign. If False, allow dtypes of different
precision and sign across different files. For example, if one
file contains a uint32 dataset and another contains an int64
dataset with the same name, the contents of both will be read
into an int64 pdarray.
allow_errors (bool) – Default False, if True will allow files with read errors to be skipped
instead of failing. A warning will be included in the return containing
the total number of files skipped due to failure and up to 10 filenames.
calc_string_offsets (bool) – Default False, if True this will tell the server to calculate the
offsets/segments array on the server versus loading them from HDF5 files.
In the future this option may be set to True as the default.
read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False,
SegArray (or other nested Parquet columns) will be ignored.
Ignored if datasets is not None
Parquet Files only.
has_non_float_nulls (bool) – Default False. This flag must be set to True to read non-float parquet columns
that contain null values.
Notes
Not currently supported for Categorical or GroupBy datasets
Examples
Read files and return data with tagging corresponding to the Categorical returned
cat.codes will link the codes in data to the filename. Data will contain the code Filename_Codes
>>> data, cat = ak.read_tagged_data(‘path/name’)
>>> data
{‘Filname_Codes’: array([0 3 6 9 12]), ‘col_name’: array([0 0 0 1])}
Reads a Zarr store from disk into a pdarray. Supports multi-dimensional pdarrays of numeric types.
To use this function, ensure you have installed the blosc dependency (make install-blosc)
and have included ZarrMsg.chpl in the ServerModules.cfg file.
Parameters:
store_path (str) – The path to the Zarr store. The path must be to a directory that contains a .zarray
file containing the Zarr store metadata.
ndim (int) – The number of dimensions in the array
hostname (str) – The hostname of the pdarray that sent the array
port (int_scalars) – The port to send the array over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
pdarray.transfer().
Returns:
The pdarray sent from the sending server to the current
receiving server.
hostname (str) – The hostname of the dataframe that sent the array
port (int_scalars) – The port to send the dataframe over. This needs to be an
open port (i.e., not one that the Arkouda server is
running on). This will open up numLocales ports,
each of which in succession, so will use ports of the
range {port..(port+numLocales)} (e.g., running an
Arkouda server of 4 nodes, port 1234 is passed as
port, Arkouda will use ports 1234, 1235, 1236,
and 1237 to send the array data).
This port much match the port passed to the call to
pdarray.send_array().
Returns:
The dataframe sent from the sending server to the
current receiving server.
Unlike other save/load methods using snapshot restore will save DataFrames alongside other
objects in HDF5. Thus, they are returned within the dictionary as a dataframe.
DEPRECATED
Save multiple named pdarrays to HDF5/Parquet files.
:param columns: Collection of arrays to save
:type columns: dict or list of pdarrays
:param prefix_path: Directory and filename prefix for output files
:type prefix_path: str
:param names: Dataset names for the pdarrays
:type names: list of str
:param file_format: ‘HDF5’ or ‘Parquet’. Defaults to hdf5
:type file_format: str
:param mode: By default, truncate (overwrite) the output files if they exist.
If ‘append’, attempt to create new dataset in existing files.
Parameters:
file_type (str ("single" | "distribute")) – Default: distribute
Single writes the dataset to a single file
Distribute writes the dataset to a file per locale
Only used with HDF5
compression (str (None | "snappy" | "gzip" | "brotli" | "zstd" | "lz4")) – Optional
Select the compression to use with Parquet files.
Only used with Parquet.
Return type:
None
Raises:
ValueError – Raised if (1) the lengths of columns and values differ or (2) the mode
is not ‘truncate’ or ‘append’
Creates one file per locale containing that locale’s chunk of each pdarray.
If columns is a dictionary, the keys are used as the HDF5 dataset names.
Otherwise, if no names are supplied, 0-up integers are used. By default,
any existing files at path_prefix will be overwritten, unless the user
specifies the ‘append’ mode, in which case arkouda will attempt to add
<columns> as new datasets to existing files. If the wrong number of files
is present or dataset names already exist, a RuntimeError is raised.
Examples
>>> a=ak.arange(25)>>> b=ak.arange(25)>>> # Save with mapping defining dataset names>>> ak.save_all({'a':a,'b':b},'path/name_prefix',file_format='Parquet')>>> # Save using names instead of mapping>>> ak.save_all([a,b],'path/name_prefix',names=['a','b'],file_format='Parquet')
D.update([E, ]**F) -> None. Update D from dict/iterable E and F.
If E is present and has a .keys() method, then does: for k in E: D[k] = E[k]
If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v
In either case, this is followed by: for k in F: D[k] = F[k]
D.update([E, ]**F) -> None. Update D from dict/iterable E and F.
If E is present and has a .keys() method, then does: for k in E: D[k] = E[k]
If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v
In either case, this is followed by: for k in F: D[k] = F[k]
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True,
the sine will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing sin for each element
of the original pdarray
Return a pair of integers, whose ratio is exactly equal to the original
floating point number, and with a positive denominator.
Raise OverflowError on infinities and a ValueError on NaNs.
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True,
the hyperbolic sine will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing hyperbolic sine for each element
of the original pdarray
Computes the sample skewness of an array.
Skewness > 0 means there’s greater weight in the right tail of the distribution.
Skewness < 0 means there’s greater weight in the left tail of the distribution.
Skewness == 0 means the data is normally distributed.
Based on the scipy.stats.skew function.
Parameters:
pda (pdarray) – A pdarray of values that will be calculated to find the skew
bias (bool, optional) – If False, then the calculations are corrected for statistical bias.
The class for sparse arrays. This class contains only the
attributies of the array; the data resides on the arkouda
server. When a server operation results in a new array, arkouda
will create a sparray instance that points to the array data on
the server. As such, the user should not initialize sparray
instances directly.
Takes the square root of array. If where is given, the operation will only take place in
the positions where the where condition is True.
Parameters:
pda (pdarray) – A pdarray of values that will be square rooted
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the
corresponding value will be square rooted. Elsewhere, it will retain its original value.
Default set to True.
Returns:
pdarray
Returns a pdarray of square rooted values, under the boolean where condition.
The standard deviation is the square root of the average of the squared
deviations from the mean, i.e., std=sqrt(mean((x-x.mean())**2)).
The average squared deviation is normally calculated as
x.sum()/N, where N=len(x). If, however, ddof is specified,
the divisor N-ddof is used instead. In standard statistical
practice, ddof=1 provides an unbiased estimator of the variance
of the infinite population. ddof=0 provides a maximum likelihood
estimate of the variance for normally distributed variables. The
standard deviation computed in this function is the square root of
the estimated variance, so even with ddof=1, it will not be an
unbiased estimate of the standard deviation per se.
This represents a generic version of type ‘origin’ with type arguments ‘params’.
There are two kind of these aliases: user defined and special. The special ones
are wrappers around builtin collections and ABCs in collections.abc. These must
have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated,
this is used by e.g. typing.List and typing.Dict.
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True,
the tangent will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing tangent for each element
of the original pdarray
where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True,
the hyperbolic tangent will be applied to the corresponding value. Elsewhere, it will retain
its original value. Default set to True.
Returns:
A pdarray containing hyperbolic tangent for each element
of the original pdarray
Return a fixed frequency TimedeltaIndex, with day as the default
frequency. Alias for ak.Timedelta(pd.timedelta_range(args)).
Subject to size limit imposed by client.maxTransferBytes.
Parameters:
start (str or timedelta-like, default None) – Left bound for generating timedeltas.
end (str or timedelta-like, default None) – Right bound for generating timedeltas.
periods (int, default None) – Number of periods to generate.
freq (str or DateOffset, default 'D') – Frequency strings can have multiples, e.g. ‘5H’.
name (str, default None) – Name of the resulting TimedeltaIndex.
closed (str, default None) – Make the interval closed with respect to the given frequency to
the ‘left’, ‘right’, or both sides (None).
Returns:
rng
Return type:
TimedeltaIndex
Notes
Of the four parameters start, end, periods, and freq,
exactly three must be specified. If freq is omitted, the resulting
TimedeltaIndex will have periods linearly spaced elements between
start and end (closed on both sides).
To learn more about the frequency strings, please see this link.
Return a fixed frequency TimedeltaIndex, with day as the default
frequency. Alias for ak.Timedelta(pd.timedelta_range(args)).
Subject to size limit imposed by client.maxTransferBytes.
Parameters:
start (str or timedelta-like, default None) – Left bound for generating timedeltas.
end (str or timedelta-like, default None) – Right bound for generating timedeltas.
periods (int, default None) – Number of periods to generate.
freq (str or DateOffset, default 'D') – Frequency strings can have multiples, e.g. ‘5H’.
name (str, default None) – Name of the resulting TimedeltaIndex.
closed (str, default None) – Make the interval closed with respect to the given frequency to
the ‘left’, ‘right’, or both sides (None).
Returns:
rng
Return type:
TimedeltaIndex
Notes
Of the four parameters start, end, periods, and freq,
exactly three must be specified. If freq is omitted, the resulting
TimedeltaIndex will have periods linearly spaced elements between
start and end (closed on both sides).
To learn more about the frequency strings, please see this link.
Write Arkouda object(s) to CSV file(s). All CSV Files written by Arkouda
include a header denoting data types of the columns.
Parameters:
columns (Mapping[str, pdarray] or List[pdarray]) – The objects to be written to CSV file. If a mapping is used and names is None
the keys of the mapping will be used as the dataset names.
prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended
when they are written to disk.
names (List[str] (Optional)) – names of dataset to be written. Order should correspond to the order of data
provided in columns.
col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file.
Please be sure that the value used DOES NOT appear in your dataset.
overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will
be overwritten. If False, an error will be returned if existing files are found.
Return type:
None
Raises:
ValueError – Raised if any datasets are present in all csv files or if one or
more of the specified files do not exist
RuntimeError – Raised if one or more of the specified files cannot be opened.
If allow_errors is true this may be raised if no values are returned
from the server.
TypeError – Raised if we receive an unknown arkouda_type returned from the server
columns (dict or list of pdarrays) – Collection of arrays to save
prefix_path (str) – Directory and filename prefix for output files
names (list of str) – Dataset names for the pdarrays
mode ({'truncate' | 'append'}) – By default, truncate (overwrite) the output files if they exist.
If ‘append’, attempt to create new dataset in existing files.
file_type (str ("single" | "distribute")) – Default: distribute
Single writes the dataset to a single file
Distribute writes the dataset to a file per locale
Return type:
None
Raises:
ValueError – Raised if (1) the lengths of columns and values differ or (2) the mode
is not ‘truncate’ or ‘append’
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Creates one file per locale containing that locale’s chunk of each pdarray.
If columns is a dictionary, the keys are used as the HDF5 dataset names.
Otherwise, if no names are supplied, 0-up integers are used. By default,
any existing files at path_prefix will be overwritten, unless the user
specifies the ‘append’ mode, in which case arkouda will attempt to add
<columns> as new datasets to existing files. If the wrong number of files
is present or dataset names already exist, a RuntimeError is raised.
Examples
>>> a=ak.arange(25)>>> b=ak.arange(25)
>>> # Save with mapping defining dataset names>>> ak.to_hdf({'a':a,'b':b},'path/name_prefix')
>>> # Save using names instead of mapping>>> ak.to_hdf([a,b],'path/name_prefix',names=['a','b'])
columns (dict or list of pdarrays) – Collection of arrays to save
prefix_path (str) – Directory and filename prefix for output files
names (list of str) – Dataset names for the pdarrays
mode ({'truncate' | 'append'}) – By default, truncate (overwrite) the output files if they exist.
If ‘append’, attempt to create new dataset in existing files.
‘append’ is deprecated, please use the multi-column write
compression (str (Optional)) –
Default None
Provide the compression type to use when writing the file.
Supported values: snappy, gzip, brotli, zstd, lz4
convert_categoricals: bool
Defaults to False
Parquet requires all columns to be the same size and Categoricals
don’t satisfy that requirement.
if set, write the equivalent Strings in place of any Categorical columns.
Return type:
None
Raises:
ValueError – Raised if (1) the lengths of columns and values differ or (2) the mode
is not ‘truncate’ or ‘append’
RuntimeError – Raised if a server-side error is thrown saving the pdarray
Creates one file per locale containing that locale’s chunk of each pdarray.
If columns is a dictionary, the keys are used as the Parquet column names.
Otherwise, if no names are supplied, 0-up integers are used. By default,
any existing files at path_prefix will be overwritten, unless the user
specifies the ‘append’ mode, in which case arkouda will attempt to add
<columns> as new datasets to existing files. If the wrong number of files
is present or dataset names already exist, a RuntimeError is raised.
Examples
>>> a=ak.arange(25)>>> b=ak.arange(25)
>>> # Save with mapping defining dataset names>>> ak.to_parquet({'a':a,'b':b},'path/name_prefix')
>>> # Save using names instead of mapping>>> ak.to_parquet([a,b],'path/name_prefix',names=['a','b'])
Writes a pdarray to disk as a Zarr store. Supports multi-dimensional pdarrays of numeric types.
To use this function, ensure you have installed the blosc dependency (make install-blosc)
and have included ZarrMsg.chpl in the ServerModules.cfg file.
Parameters:
store_path (str) – The path at which Zarr store should be written
chunk_shape (tuple) – The shape of the chunks to be used in the Zarr store
Raises:
ValueError – Raised if the number of dimensions in the chunk shape does not match
the number of dimensions in the array or if the array is not a 32 or 64 bit numeric type
diag (int_scalars) – if diag = 0, zeros start just below the main diagonal
if diag = 1, zeros start at the main diagonal
if diag = 2, zeros start just above the main diagonal
etc.
diag (int_scalars) – if diag = 0, zeros start just above the main diagonal
if diag = 1, zeros start at the main diagonal
if diag = 2, zeros start just below the main diagonal
etc.
Returns the unique elements of an array, sorted if the values are integers.
There is an optional output in addition to the unique elements: the number
of times each unique value comes up in the input array.
return_groups (bool, optional) – If True, also return grouping information for the array.
assume_sorted (bool, optional) – If True, assume pda is sorted and skip sorting step
return_indices (bool, optional) – Only applicable if return_groups is True.
If True, return unique key indices along with other groups
Returns:
unique ((list of) pdarray, Strings, or Categorical) – The unique values. If input dtype is int64, return values will be sorted.
permutation (pdarray, optional) – Permutation that groups equivalent values together (only when return_groups=True)
segments (pdarray, optional) – The offset of each group in the permuted array (only when return_groups=True)
Raises:
TypeError – Raised if pda is not a pdarray or Strings object
RuntimeError – Raised if the pdarray or Strings dtype is unsupported
Notes
For integer arrays, this function checks to see whether pda is sorted
and, if so, whether it is already unique. This step can save considerable
computation. Otherwise, this function will sort pda.
Returns the unique elements of an array, sorted if the values are integers.
There is an optional output in addition to the unique elements: the number
of times each unique value comes up in the input array.
return_groups (bool, optional) – If True, also return grouping information for the array.
assume_sorted (bool, optional) – If True, assume pda is sorted and skip sorting step
return_indices (bool, optional) – Only applicable if return_groups is True.
If True, return unique key indices along with other groups
Returns:
unique ((list of) pdarray, Strings, or Categorical) – The unique values. If input dtype is int64, return values will be sorted.
permutation (pdarray, optional) – Permutation that groups equivalent values together (only when return_groups=True)
segments (pdarray, optional) – The offset of each group in the permuted array (only when return_groups=True)
Raises:
TypeError – Raised if pda is not a pdarray or Strings object
RuntimeError – Raised if the pdarray or Strings dtype is unsupported
Notes
For integer arrays, this function checks to see whether pda is sorted
and, if so, whether it is already unique. This step can save considerable
computation. Otherwise, this function will sort pda.
Registered names/pdarrays in the server are immune to deletion until
they are unregistered.
Examples
>>> a=zeros(100)>>> a.register("my_zeros")>>> # potentially disconnect from server and reconnect to server>>> b=ak.attach_pdarray("my_zeros")>>> # ...other work...>>> ak.unregister_pdarray_by_name(b)
Overwrite the datasets with name appearing in names or keys in columns if columns
is a dictionary
Parameters:
columns (dict or list of pdarrays) – Collection of arrays to save
prefix_path (str) – Directory and filename prefix for output files
names (list of str) – Dataset names for the pdarrays
repack (bool) – Default: True
HDF5 does not release memory on delete. When True, the inaccessible
data (that was overwritten) is removed. When False, the data remains, but is
inaccessible. Setting to false will yield better performance, but will cause
file sizes to expand.
Raises:
RuntimeError – Raised if a server-side error is thrown saving the datasets
Notes
If file does not contain File_Format attribute to indicate how it was saved,
the file name is checked for _LOCALE#### to determine if it is distributed.
If the datasets provided do not exist, they will be added
Because HDF5 deletes do not release memory, this will create a copy of the
file with the new data
This workflow is slightly different from to_hdf to prevent reading and
creating a copy of the file for each dataset
This function differs from histogram() in that it only returns
counts for values that are present, leaving out empty “bins”. This
function delegates all logic to the unique() method where the
return_counts parameter is set to True.
The variance is the average of the squared deviations from the mean,
i.e., var=mean((x-x.mean())**2).
The mean is normally calculated as x.sum()/N, where N=len(x).
If, however, ddof is specified, the divisor N-ddof is used
instead. In standard statistical practice, ddof=1 provides an
unbiased estimator of the variance of a hypothetical infinite population.
ddof=0 provides a maximum likelihood estimate of the variance for
normally distributed variables.
Create a new structured or unstructured void scalar.
length_or_dataint, array-like, bytes-like, object
One of multiple meanings (see notes). The length or
bytes data of an unstructured void. Or alternatively,
the data to be stored in the new scalar when dtype
is provided.
This can be an array-like, in which case an array may
be returned.
dtypedtype, optional
If provided the dtype of the new scalar. This dtype must
be “void” dtype (i.e. a structured or unstructured void,
see also defining-structured-types).
..versionadded:: 1.24
For historical reasons and because void scalars can represent both
arbitrary byte data and structured dtypes, the void constructor
has three calling conventions:
np.void(5) creates a dtype="V5" scalar filled with five
\0 bytes. The 5 can be a Python or NumPy integer.
np.void(b"bytes-like") creates a void scalar from the byte string.
The dtype itemsize will match the byte string length, here "V10".
When a dtype= is passed the call is roughly the same as an
array creation. However, a void scalar rather than array is returned.
Please see the examples which show all three different conventions.
>>> np.void(5)void(b'\x00\x00\x00\x00\x00')>>> np.void(b'abcd')void(b'\x61\x62\x63\x64')>>> np.void((5,3.2,"eggs"),dtype="i,d,S5")(5, 3.2, b'eggs') # looks like a tuple, but is `np.void`>>> np.void(3,dtype=[('x',np.int8),('y',np.int8)])(3, 3) # looks like a tuple, but is `np.void`
dtype (Optional[Union[type, str]], optional) – The data-type of the output array. If not provided, the output
array will be determined using np.common_type on the
input arrays Defaults to None
casting ({"no", "equiv", "safe", "same_kind", "unsafe"], optional) – Controls what kind of data casting may occur - currently unused
Returns an array with elements chosen from A and B based upon a
conditioning array. As is the case with numpy.where, the return array
consists of values from the first array (A) where the conditioning array
elements are True and from the second array (B) where the conditioning
array elements are False.
Parameters:
condition (pdarray) – Used to choose values from A or B
TypeError – Raised if the condition object is not a pdarray, if A or B is not
an int, np.int64, float, np.float64, pdarray, str, Strings, Categorical
if pdarray dtypes are not supported or do not match, or multiple
condition clauses (see Notes section) are applied
ValueError – Raised if the shapes of the condition, A, and B pdarrays are unequal
A and B must have the same dtype and only one conditional clause
is supported e.g., n < 5, n > 1, which is supported in numpy
is not currently supported in Arkouda
Returns an array with elements chosen from A and B based upon a
conditioning array. As is the case with numpy.where, the return array
consists of values from the first array (A) where the conditioning array
elements are True and from the second array (B) where the conditioning
array elements are False.
Parameters:
condition (pdarray) – Used to choose values from A or B
TypeError – Raised if the condition object is not a pdarray, if A or B is not
an int, np.int64, float, np.float64, pdarray, str, Strings, Categorical
if pdarray dtypes are not supported or do not match, or multiple
condition clauses (see Notes section) are applied
ValueError – Raised if the shapes of the condition, A, and B pdarrays are unequal
A and B must have the same dtype and only one conditional clause
is supported e.g., n < 5, n > 1, which is supported in numpy
is not currently supported in Arkouda
log_msg (str) – The message to be added to the server log
tag (str) – The tag to use in the log. This takes the place of the server function name.
Allows for easy identification of custom logs.
Defaults to “ClientGeneratedLog”
log_lvl (LogLevel) – The type of log to be written
Defaults to LogLevel.INFO